¨ t Freiburg Albert-Ludwigs-Universita Technische Fakult¨at, Institut f¨ ur Informatik
Query Workflows over Web Data Sources Dissertation zur Erlangung des Doktorgrades (Dr.-Ing.) von Thomas Daniel Hornung geboren am 2.12.1977 in T¨ ubingen
Dekan:
Prof. Dr. Bernd Becker
Referenten:
Prof. Dr. Georg Lausen Prof. Dr. Wolfgang May
Datum der Promotion: 21. Oktober 2010
Freiburg im Breisgau, 22. Oktober 2010
Thomas Hornung
Zusammenfassung Die Menge der verf¨ ugbaren Information im Web w¨achst stetig an. Wenn man dies in Bezug setzt zur Geschwindigkeit mit der Menschen Webseiten lesen und die enthaltenen Informationen verarbeiten k¨onnen, sieht man, dass sich die Kluft zwischen den verf¨ ugbaren Informationen und den real verarbeitbaren Informationen kontinuierlich vergr¨oßert. Um dennoch die gew¨ unschten Information in akzeptabler Zeit zu finden, nutzen viele Anwender Suchmaschinen, die bezogen auf ein (oder mehrere) Suchworte eine geordnete Liste von Resultaten liefern. Diese Resultate sind diejenigen Webseiten, die ein oftmals geheimer Such- und Ranking-Algorithmus als relevant einstuft. Der Nachteil dieses Ansatzes ist, dass der Suchende somit immer noch die top-k Webseiten lesen und verarbeiten muss, um die gew¨ unschte Information zu finden. Im Falle, dass diese nur durch die Verkn¨ upfung von in den Suchresultaten enthaltenen Daten gefunden werden kann, ist die Akquise noch merklich schwieriger, da die Suchmaschine den Benutzer hierbei nicht unterst¨ utzen kann. Typischerweise finden sich solche Suchmuster bei Anfragen an Deep Web-Quellen, die besonders f¨ ur die Suche nach spezifischen Daten (wie bspw. dem g¨ unstigsten Preis f¨ ur eine Verbindung von Freiburg in Deutschland nach St. Malo in Frankreich) geeignet sind. Es gibt Bestrebungen seitens der Suchmaschinenhersteller auch solche Datenquellen zu indizieren, wobei dies selbst im Idealfall nur f¨ ur den Teil des Deep Webs funktioniert, bei dem die zugrundeliegenden Daten relativ statisch sind. Der Fokus dieser Dissertation sind diese u ¨ber das Web verf¨ ugbaren Daten (die auch in Webseiten enthalten sein k¨onnen) mit dem Ziel deren Akquise und Verarbeitung zu automatisieren. Dabei soll es dem Benutzer m¨oglich sein, seinen internen Akquise- und Verarbeitungsprozess zu formalisieren, um so die Vorteile von modernen Computern – die effizient Daten verarbeiten k¨onnen – mit dem Wissen von Dom¨anenexperten verbinden zu k¨onnen, die oftmals sehr gute Kenntnisse u ¨ber vorhandene Datenquellen und deren sinnvolle Verkn¨ upfung besitzen. Ein wesentlicher Aspekt ist somit, dass der Benutzer aktiv in den Verarbeitungsprozess eingreifen kann, was die Transparenz der Resultate erh¨oht. Zur Realisierung dieses Ziels leistet diese Dissertation die folgenden drei Beitr¨age: • Einheitliches Anfragemodell f¨ ur Web-Datenquellen: durch den Einsatz von Semantic Web-Technologien wie RDF und OWL ist es einfach m¨oglich, Daten im Web zu ver¨offentlichen die maschinen-verst¨andlich (d.h. eine eindeutige Semantik haben) sind. De facto machen diese Daten jedoch nur einen Bruchteil der verf¨ ugbaren Daten im Web aus. Teilweise sind diese bereits machinen-anfragbar als Web Services (in unterschiedlichen Formalismen, bspw. REST vs. SOAP/WSDL), wobei das Gros der Daten entweder nur verf¨ ugbar ist durch das Ausf¨ ullen von Web-Formularen, bzw. sind Daten auch einfach direkt als Tabelle auf verschiedenen Webseiten online abrufbar. Insbesondere f¨ ur die beiden letztgenannten Varianten ist bereits der automatische Zugriff eine Herausforderung. Der erste Beitrag dieser Dissertation behandelt diesen Themenkomplex und stellt
ein einheitliches Anfragemodell f¨ ur Web-Datenquellen vor, welches sowohl die technischen Details des eigentlichen Zugriffs als auch die semantische Bedeutung der Daten umfasst. Dadurch wird die automatische Verarbeitung der Daten erm¨oglicht. • Deklarative Verkn¨ upfung von Web-Datenquellen: um anspruchsvolle Informationsbed¨ urfnisse realisieren zu k¨onnen, ist ein ausdrucksstarker Formalismus notwendig. Menschliche Probleml¨osestrategien sind typischerweise mengenorientiert, wobei hier die Zwischenresultate normalerweise auf Papier “gespeichert” werden. Eine weitere Erkenntnis ist, dass in diesen Strategien oftmals Prozessfragmente (d.h. Interaktionen die mehrere Web-Datenquellen betreffen) rekursiv ausgef¨ uhrt werden. Der zweite Beitrag baut auf dieser prozess-orientierten Abstraktion auf und erweitert die Prozessalgebra CCS mit relationalem Datenfluss zur deklarativen Spezifikation der Verkn¨ upfung von Web-Datenquellen. • Konfigurierbarer Datentyp f¨ ur graphf¨ormige Suchr¨aume: Der sich durch die gew¨ahlte Probleml¨osestrategie ergebende Suchraum hat oftmals die Form eines Graphen. Die gesuchten Endresultate ergeben sich dann durch die Berechnung von (Teilen) der transitiven H¨ ulle dieses Graphs. Im Kontext von Webanfragen ist dieser SuchraumGraph weder im Voraus bekannt noch vollst¨andig verf¨ ugbar um entsprechende Algorithmen anzuwenden, sondern er wird erst zur Laufzeit durch Resultate von Webanfragen dynamisch aufgebaut. Als Folge davon ist der Graph selbst dynamisch und eine Materialisierung oder langfristige Speicherung der Graphinhalte ist nicht sinnvoll. Diese Umgebungsbedingungen erfordern Ans¨atze mit speziellem Fokus auf der Exploration und der Expansionsstrategie des Graphen. Der dritte Beitrag dieser Dissertation ist der konfigurierbare Graphdatentyp (KGDT), der eine Ontologie und ein API f¨ ur konfigurierbare Graphen umfasst. Der KGDT kombiniert dabei die generischen Aspekte von Graphen, wie bspw. das Einf¨ ugen von Kanten, die Verl¨angerung von Pfaden und die Unterst¨ utzung f¨ ur verschiedene Explorationsalgorithmen, mit anwendungsspezifischen Aspekten. Zur Illustration der Praxistauglichkeit des vorgestellten Ansatzes wird ein Szenario exemplarisch umgesetzt mit dem insbesondere Forscher gut vertraut sind: die Reiseplanung. Es zeigt sich, dass ein “puristischer” algorithmischer Ansatz in diesem Fall bessere Resultate liefert als ein typisches Reiseb¨ uro in Bezug auf das Finden von nicht intuitiven – aber relevanten – Verbindungen.
Abstract While the amount of available information on the Web grows steadily, the information processing capabilities of humans remain the same, resulting in a growing gap between the available information and the processable information. A common resort for many users is to use search engines that operate mainly on the level of whole Web sites, which also is the granularity of the returned results. The task of sifting through and evaluating the found content is still left to the user, who is forced to trust an to him unknown ranking algorithm. The situation gets even more dire when information from several of these sources is to be combined in a non-trivial way. Yet, this is a typical interaction pattern when searching for data on the Deep or Hidden Web (e.g. for finding the cheapest price for a connection from Freiburg, Germany to St. Malo, France). Although search engines try to index this data, often the data itself is of such volatile nature that this approach falls short. This thesis addresses this problem by focusing on the structured data that can be gleaned from Web-accessible data sources (referred to as Web data sources) and its processing. The proposed idea is to automate the information acquisition and processing humans follow by declaratively specifying the encompassed Web data sources and their combination as query workflows. This way, the benefits of the vast information processing capabilities of modern computers can be pooled with the expertise of people working in the target domain allowing for a powerful and user-defined exploration of the search space. To achieve this goal, this thesis makes three contributions: • Uniform access to Web data sources: the preferred way to publish data on the Web is by using Semantic Web technologies like RDF and OWL that are already machine-understandable. The de facto situation is different: (structured) information is published via Web Services (here again in different formalisms, e.g. REST vs. SOAP/WSDL – already machine-accessible) or only geared towards human visitors as regular Web sites using forms for providing inputs or are just scattered on arbitray Web pages in tables. For the latter alone the automatic access is challenging and non-trivial. The first contribution of this thesis is the introduction of a uniform model for accessing such Web data sources covering the lower technical details as well as the semantic level, paving the way for data processing. • Declarative combination of Web data sources: real-world information needs require a powerful and expressive formalism for the analysis and reconciliation of distributed Web data sources. Human problem-solving often follows an intrinsic set-oriented data model, where different intermediate solutions are “stored” with pencil and paper. Another observation is that often process fragments (i.e. interaction patterns spanning several Web data sources) are executed recursively. The second contribution builds on this process-oriented notion of human problemsolving and extends the process algebra CCS with relational dataflow for the specification of query workflows allowing users to declaratively express information needs.
• Configurable data type for graph-structured search spaces: The explored search space is often graph-structured. Thus, a recurring motive when designing query workflows is the computation of (parts of) transitive closures in graphs. In the context of the Web, this graph is neither known nor materialized a priori to run algorithms on it, but is explored only at runtime, using one or more Web data sources. Often, even the graph data itself is dynamic, which does not allow for long-term materialization or caching. These characteristics require completely different algorithms where the exploration and expansion strategy for the graph itself is the central issue. The third contribution of this thesis is the Configurable Graph Data Type (CGDT) that provides an ontology and an API for configurable graphs. The design of the CGDT combines generic graph behavior (e.g. insertion of edges, instantiation of paths, and support for exploration algorithms) with application-specific configurability. Finally, the applicability of the approach is demonstrated with an application scenario that is well-known to researchers: travel planning. In this demanding setting, it is shown that a fully algorithmic approach “outperforms” common travel agencies in terms of finding non-intuitive – yet relevant – connections.
To Monika, my beloved wife.
Contents
I.
Preface
1. Introduction 1.1. Motivation . . . . . . . . . . . . . . 1.2. Thesis Outline . . . . . . . . . . . . 1.2.1. How to read this thesis . . . 1.2.2. Contributions and Published
1 . . . . . . . . . . . . Work
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
3 3 6 7 7
II. Preliminaries
11
2. The Semantic Web 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. The Semantic Web Layer Cake in a Nutshell . . . 2.2. XML & Related Technologies . . . . . . . . . . . . . . . 2.2.1. Extensible Markup Language (XML) . . . . . . . 2.2.2. Accessing and Querying XML: XPath & XQuery 2.3. RDF, RDFS & SPARQL . . . . . . . . . . . . . . . . . . 2.3.1. Resource Description Framework . . . . . . . . . 2.3.2. RDF Schema . . . . . . . . . . . . . . . . . . . . 2.3.3. SPARQL Query Language for RDF . . . . . . . .
. . . . . . . . .
13 13 14 17 17 21 26 27 29 31
. . . . . . . .
35 35 36 37 42 45 45 51 53
3. The MARS Framework 3.1. Introduction . . . . . . . . . . . . . . . . . . . 3.2. Overview: The MARS Framework . . . . . . . 3.2.1. Hierarchical Language Structure . . . . 3.2.2. State, Communication, and Data Flow 3.3. The Action Component . . . . . . . . . . . . . 3.3.1. Atomic Constituents . . . . . . . . . . 3.3.2. Operators . . . . . . . . . . . . . . . . 3.4. Technical Realization . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
. . . . . . . . .
. . . . . . . .
i
III. Accessing Web Data Sources
57
4. The Annotation Layer 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 4.2. Preliminary Considerations . . . . . . . . . . . . . . 4.3. Annotation of Web Data Sources . . . . . . . . . . 4.4. Web Data Source Description Language . . . . . . 4.4.1. Base Views . . . . . . . . . . . . . . . . . . 4.4.2. Derived Views . . . . . . . . . . . . . . . . . 4.5. Use Case: Annotation of Web Service Sources . . . 4.5.1. SOAP/WSDL . . . . . . . . . . . . . . . . . 4.5.2. Representational State Transfer (REST) . . 4.6. Annotation of Query Workflows . . . . . . . . . . . 4.7. Outlook: Query Workflows over Web Data Sources 4.8. Related Work . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
59 59 61 63 65 66 68 76 76 80 83 86 91
. . . . .
. . . . .
95 95 96 99 99 102
. . . . . . . .
107 . 107 . 108 . 114 . 115 . 120 . 125 . 128 . 129
5. The Query Layer 5.1. Introduction . . . . . . . . 5.2. From Semantic Annotation 5.3. Query Engine Overview . 5.3.1. Micro View . . . . 5.3.2. Macro View . . . .
. . . . . . . . . . . . to Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6. The Source Layer 6.1. Introduction . . . . . . . . . . . . . . . . . . . 6.2. Static Web Data Sources . . . . . . . . . . . . 6.3. Dynamic Deep Web Sources . . . . . . . . . . 6.3.1. Form Analysis . . . . . . . . . . . . . . 6.3.2. Deep Web Navigation . . . . . . . . . 6.3.3. Web Data Extraction and Data Record 6.3.4. DWQL Engine Architecture . . . . . . 6.4. Related Work . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Labeling . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . . . .
. . . . . . . . . . . .
IV. Combining Web Data Sources 7. Tailoring CCS to Query Workflows 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Preliminaries: MARS & CCS . . . . . . . . . . . . . . . . . . . . . . 7.2.1. The Process Model: Processes and their Constituents . . . . 7.2.2. State, Communication, and Data Flow via Variable Bindings 7.3. RelCCS: The Relational Data Flow Process Language . . . . . . . . 7.3.1. Syntax and Semantics of RelCCS . . . . . . . . . . . . . . .
ii
137 . . . . . .
. . . . . .
. . . . . .
. . . . . .
139 139 141 142 143 144 144
7.3.2. Recursive Processes in RelCCS . 7.3.3. Data-Oriented RelCCS Operators 7.3.4. Technical Realization . . . . . . . 7.4. Use Cases . . . . . . . . . . . . . . . . . 7.4.1. Bacon Numbers . . . . . . . . . . 7.4.2. Travel Planning . . . . . . . . . . 7.5. Related Work . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
8. Support for Graph-Structured Domains 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8.2. Conceptual Overview . . . . . . . . . . . . . . . . . . 8.3. Signature and Operations . . . . . . . . . . . . . . . 8.3.1. Data Definition Language (DDL) . . . . . . . 8.3.2. Data Manipulation Language (DML) . . . . . 8.4. Configurability of the Exploration Process . . . . . . 8.5. Technical Realization . . . . . . . . . . . . . . . . . . 8.5.1. Implementation Details . . . . . . . . . . . . . 8.5.2. CGDT as MARS Action and Query Language 8.6. Use Case: Bacon Numbers + CGDT . . . . . . . . . 8.7. Related Work . . . . . . . . . . . . . . . . . . . . . . 9. Application Scenario: Travel Planning 9.1. Introduction . . . . . . . . . . . . . . 9.2. Web Data Source Overview . . . . . 9.3. The Graph Configuration . . . . . . . 9.4. The Exploration Process . . . . . . . 9.5. The Exploration Process (Revisited) 9.6. Experimental Evaluation . . . . . . .
V. Discussion
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . .
. . . . . . .
149 151 153 153 154 159 160
. . . . . . . . . . .
169 . 169 . 171 . 173 . 174 . 179 . 181 . 188 . 188 . 190 . 192 . 198
. . . . . .
203 . 203 . 205 . 206 . 208 . 212 . 214
221
10.Conclusion 223 10.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 10.2. Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A. Configurable Graph Data Type 227 A.1. CGDT RDFS Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Bibliography
229
iii
iv
List of Figures 1.1. System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Reading graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 9
2.1. Semantic Web layer cake . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2. Flight plan of the carrier “Lufthansa” (excerpt) . . . . . . . . . . . . . . . 18 2.3. The Air DB as RDF graph (excerpt) . . . . . . . . . . . . . . . . . . . . . 27 3.1. 3.2. 3.3. 3.4. 3.5. 3.6.
Evaluation phases of ECA rules . . . . . . . . . . . . . MARS components and associated languages . . . . . . Use of variables in an ECA rule . . . . . . . . . . . . . Shipping of variable bindings to query engine and back CCS transition rules . . . . . . . . . . . . . . . . . . . MARS CCS Architecture and Communication . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
36 37 43 44 51 54
4.1. 4.2. 4.3. 4.4. 4.5.
Web Data Source Access Layers . . . . . . . . . . . . . . . . . Consecutive computation of the overall result . . . . . . . . . Web Service “layer cake” . . . . . . . . . . . . . . . . . . . . . Web Service life cycle . . . . . . . . . . . . . . . . . . . . . . . Overview of REST-style (read-only) Web Service architecture
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
60 73 77 78 80
5.1. 5.2. 5.3. 5.4.
Parallel processing of input bindings . . . . . . Caching in DWQL and WSQL . . . . . . . . . . Generic analysis of view signature . . . . . . . . Generic communication during query evaluation
. . . .
. . . .
. . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
100 101 103 105
6.1. Characterization of Web data sources . . . . . . . . . . . . 6.2. Acquisition and query pattern for static Web data sources 6.3. Accessing a Deep Web site . . . . . . . . . . . . . . . . . . 6.4. Web form with HTML representation . . . . . . . . . . . . 6.5. Labeling of input elements . . . . . . . . . . . . . . . . . . 6.6. Relation tree for static input elements . . . . . . . . . . . . 6.7. Relation tree for dynamic input elements . . . . . . . . . . 6.8. Deep Web navigation process . . . . . . . . . . . . . . . . 6.9. Intermediate page for the German railways portal . . . . . 6.10. KAPath expression allowing optional HTML elements . . . 6.11. HTML result page layouts . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
108 109 115 116 117 119 119 121 122 124 126
v
6.12. Labeling of the HTML result page . . . . . . . . . . . . . . . . . . . . . . . 126 6.13. DWQL engine architecture and cooperation with MARS . . . . . . . . . . 128 7.1. CCS transition rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.1. Basic Notions of the CGDT Ontology . . . . . . . . . . . . . . . . . . . . . 174 9.1. Overview of possible connections between G¨ottingen and St. Malo . . . . . 205 9.2. System configuration (overview) . . . . . . . . . . . . . . . . . . . . . . . . 214
vi
List of Tables 1.1. Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4.1. Mapping from WSDL to WDSDL . . . . . . . . . . . . . . . . . . . . . . . 78 7.1. Extension of the characteristic predicates movie2actor0 and actor2movie0 . . 147 7.2. #(T uples) before evaluation of the Concurrent[minus] operator = 16 . . . . 149 7.3. #(T uples) after evaluation of the Concurrent[minus] operator = 6 . . . . . 149 8.1. 8.2. 8.3. 8.4.
. . . .
. . . .
189 197 198 199
CGDT contents, vertex table (top) – edge table (bottom) . . . . . . . . . CGDT contents, path table . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation results (overview) . . . . . . . . . . . . . . . . . . . . . . . . Evaluation results (DWQL engine) . . . . . . . . . . . . . . . . . . . . . Evaluation results (WSQL engine) . . . . . . . . . . . . . . . . . . . . . . Evaluation results (CGDT engine) – vertex (top), edge (middle), path (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7. Top-35 connections from G¨ottingen to St. Malo (excerpt) . . . . . . . . .
. . . . .
212 217 218 218 218
9.1. 9.2. 9.3. 9.4. 9.5. 9.6.
CGDT CGDT CGDT CGDT
Implementation: Schematic Overview . . . . . . . . . . . . . . . contents after 1st round, edge table (top) – path table (bottom) contents after 2nd round, edge table (top) – path table (bottom) HTML result table . . . . . . . . . . . . . . . . . . . . . . . . .
. 219 . 220
vii
viii
Part I. Preface
1
Chapter 1. Introduction To furnish the means of acquiring knowledge is the greatest benefit that can be conferred upon mankind. It prolongs life itself and enlarges the sphere of existence. – John Quincy Adams (July 11, 1767 – February 23, 1848)
Contents 1.1. Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2.1. How to read this thesis
. . . . . . . . . . . . . . . . . . . . . .
7
1.2.2. Contributions and Published Work . . . . . . . . . . . . . . . .
7
1.1. Motivation While the amount of information that is available on the World Wide Web1 is growing steadily, extracting and combining this information2 appropriately can be considered an art. Additionally to the sheer mass of data, the efficient handling is aggravated by new technologies that have changed the shape of the Web dramatically since its inception by Tim Berners-Lee at the European Organization for Nuclear Research (Conseil Europ´een pour la Recherche Nucl´eaire – CERN). For once it has deepened, i.e. a recent study by He et. al [HPZC07] has found an exponential growth and great subject diversity of so-called Deep or Hidden Web [RGM01] sites. These sites offer access to an underlying database via a Web form and are geared towards human visitors. As a consequence, typical search engines are not able to provide the necessary inputs and cannot crawl and index the contents. Often, it even makes no sense to crawl and index the contents, as is e.g. the case for travel portals: what use would it be to index all possible flights between all airports for the next 30 days? Besides, the information in these Deep Web resources is in continuous flux, e.g. the availability 1 2
In the remainder of the thesis simply referred to as WWW or Web. The terms information and data will be used interchangeably in the remainder of the thesis.
3
4 of a specific flight is changing constantly. To access this data in an automated fashion is a highly non-trivial task and there is a large body of work dealing with the related “wrapping” issues (cf. [RGM01, ZHC04, HMYW04, CHZ05, DYM06, HM07, MPR+ 09b, MPR+ 09a, LNL+ 10]). For Deep Web result pages the extraction precision is usually above 90% for state-of-the-art tools [LRNdST02, SL05], which is made possible because these pages usually are very structured. Typically, the results are rendered as an HTML table, where the table header contains the description of the contents of each column. One prominent application of this research is to combine several of these Web resources and offer a single point of access by a vertical search engine that delegates the requests to the underlying Web sites and afterwards merges the results [JYL+ 09]. Another source for gleaning data are Web Services, which come in two different flavors: REST [Fie00] and SOAP/WSDL [Cer02, CCMW01, CMRW07, CHL+ 07]. These resources are already machine-accessible and operate on a syntactical level of data types known from programming languages. The advantage over Deep Web sources is that the access to these sources is comparatively trivial and the returned data is always correct in the sense that there is no confusion about the validity of a returned value. Research questions for Web services are mainly concerned with (semi-)automatic composition of several Web Services to realize a more involved business process [RS04, dVNP08]. Finally, the focus of the Semantic Web [BLHL01] is on the interaction of machines. Also proposed and pioneered by Tim Berners-Lee, data is given a clear and precise semantics by using description languages such as the Resource Description Framework (RDF) [MM04] and the Web Ontology Language (OWL) [MvH04]. The main idea is to describe a set of concepts and their relationships, denoted as ontology, and then use the therein defined concepts to assert facts about (virtually) any resoure. This forms the basis for autonomous software agents to act on behalf and in the interest of a human user. Access to RDF data is possible with the query language SPARQL [PS08], which is similar in style to SQL. The common theme of these technologies is that they provide access to structured data on the Web by different technical means and on a varying semantic level. One of the contributions of this thesis is to lift such Web data sources on a common semantic level, including a generic description and access interface, to incorporate them as data sources in the MARS (Modular Active Rules for the Semantic Web) framework [MAA05a]. MARS provides an open framework for ECA (Event-Condition-Action) rules and for processes. The core of the MARS approach are a model and an architecture for ECA rules that use heterogeneous event, query, and action languages. In this thesis, two such languages are presented: DWQL (Deep Web Query Language) and WSQL (Web Service Query Language). The former allows to pose queries against Deep Web sources, and the latter is geared towards virtually any data source that is already machineaccessible, including but not limited to Web Services (cf. Figure 1.1). The evaluation of a query with DWQL or WSQL results in a set of tuples of variable bindings of the form t = v1 /x1 , . . . , vn /xn with v1 , . . . , vn variables and x1 , . . . , xn elements of the underlying domain (which is here the set of strings, numbers, and XML literals). Thus, for a query,
4
Chapter 1. Introduction
5
the answers can be seen as a relation whose attributes are the names of the variables. Figure 1.1 Embedding of DWQL and WSQL queries in query workflows MARS framework Query Workflow
DWQL Engine
Deep Web
WSQL Engine
Web Service
SPARQL endpoint
...
Now, that the issues concerning the access to a single Web data source are solved, the question is how to combine information from different, heterogeneous Web data sources efficiently? For this, query workflows [HML09] based on an extension of CCS (Calculus of Communicating System) [Mil83] with relational dataflow are proposed in this thesis. These workflows are based on a generic data model and an extensible set of functional modules, including the ability to access Web data sources (cf. Figure 1.1). Important basic functionality includes appropriate mechanisms to deal with information acquisition and target-driven information processing on a high level, like using design patterns for acting on graph-structured domains [HM09b]. A prototypical application for query workflows, which is used as a running example throughout the thesis, is to automate travel planning enquiries. Even employees of travel agencies usually process such enquiries manually, which requires a lot of time and is potentially incomplete and suboptimal. Although the manual process follows a small number of common patterns (e.g. searching for paths in a transitive relationship distributed over several sources, like flight schedules and train schedules, with heuristics for bridging long distances vs. shorter distances, making prereservations, doing backtracking) it is hard to automatize it since the sources are not integrated, and the underlying formalism needs to be expressive to cover all the abovementioned patterns. Thus, a recurring theme in this thesis is that for many real world tasks, it is easier to design the process how to solve such a problem than stating a single query. A similar and analogous observation was made in [ORS+ 08] for Pig Latin, a language for analyzing large amounts of data using the MapReduce framework [DG04]. The main focus in this thesis is not on performance, but on the qualitative ability to express and execute complex workflows and decision processes in a reasonable time
5
6 – i.e., to replace hours of interactive human Web search by an unsupervised automated process that also may take hours but finally results in one or more proposals, including the optimal one. Also, the process design/programming is not expected to be done by casual users, but by skilled process designers in cooperation with domain experts – analogously to application database design.
1.2. Thesis Outline The thesis is divided into four different parts (excluding Part I, which is the Preface). Part II lays the theoretical and technical foundations upon which this thesis builds. Chapter 2 gives an introduction to the Semantic Web with a special focus on the Extensible Markup Language (XML) [BPSM+ 08], RDF, RDF Schema (RDFS) [BG04], and SPARQL [PS08]. These technologies serve as building block for several contributions in this thesis and in particular for the MARS framework, which is presented in Chapter 3. Part III is devoted to the generic handling of a single Web data source. Chapter 4 gives a more detailed characterization of the notion of a Web data source, which delimits the type of data sources that are captured by this approach. Also, the uniform generic high-level description of these sources is defined. Chapter 5 describes the connection and technical realization of this generic description as concrete MARS-style query languages, namely the Web Services Query Language (WSQL), and the Deep Web Query Language (DWQL). As hinted by this distinction, Deep Web sources require special consideration to make them amenable to automatic processing, which is the major focus of Chapter 6. Part IV operates solely on the level of Web data sources and proposes an approach for the efficient combination of several such sources in a non-trivial way. In this respect, Chapter 7 extends the Calculus of Communicating Systems (CCS) with relational dataflow, which is the data model of the MARS framework. Additionally, new operators are introduced into the language that are geared towards the requirements of query workflows. To clearly separate the algorithmic details of the process from the handling and efficient administration of intermediate results, Chapter 8 leverages the RDF/S language for the definition of a configurable graph data type. Capabilities include but are not limited to the on-demand expansion of paths according to extension conditions and the inductive definition of properties. This greatly eases the development of query workflows in graph structured domains. Chapter 9 concludes Part IV with an implementation of the running travel planning application scenario. In this challenging setting most of the introduced components are tested and the real-world applicability of the approach is shown. Part V wraps up the thesis with a summary and a discussion of future work in Section 10. Related work is generally discussed in each chapter separately to increase readability.
6
Chapter 1. Introduction
7
1.2.1. How to read this thesis Ideally, the thesis is read front to back, each chapter in sequence. Although, the thesis is partitioned to allow for different reading orders as well: if the reader feels sufficiently familiar with Semantic Web related technologies, Chapter 2 can be skipped. However, it is recommended to read Chapter 3, because the later chapters assume knowledge of the intricacies of the MARS framework to some extent. The fundamental generic interface description of Web data sources is introduced in Chapter 4. Also Chapter 5 is essential for the understanding, since the connection of the interface description with MARS-style query languages is presented. If the reader is not interested in the more technical details related to accessing Web data sources, Chapter 6 can be skipped. Chapters 7 and 8 are recommended. The reader may again skip Chapter 9, and finally go on to Chapter 10, which draws a conclusion and presents directions for further research. For the quick reader, the gist of the contributions of the thesis can be learned by reading Chapter 1 and then skipping directly to Chapter 10. An overview of meaningful reading sequences is shown in Figure 1.2.
1.2.2. Contributions and Published Work The main contributions of this thesis are contained in Part III and IV. Several papers have been published in the course of this Ph.D. thesis in international workshops and conferences. An early approach for integrating data from different Deep Web sources, similar to a vertical search engine, was presented in [HSL06]. Here, the data integration part was accomplished with the F-Logic [KLW95] reasoner Florid [FHK+ 97]. Improvements in wrapping and accessing Deep Web sources have been reported in [SHL06, WH08a, WH08b], where [SHL06] presented a rule-based approach to labeling of extracted data records, and [WH08a, WH08b] proposed a new approach to Deep Web navigation. On a higher, more declarative level, a unique formalism for the annotation of (not only) Deep Web sources was discussed in [HM09a, HM09c]. A more dataflow-oriented approach for the integration of several Deep Web sources that supports the piping of outputs from Deep Web sources to inputs of other Deep Web sources was reported in [HSL08a] and [HSL08b], respectively. These results were generalized in [HML09] to the notion of query workflows that build upon the extension of the process algebra CCS [Mil83]. Finally, in [HM09b, HMS10] a configurable graph data type was proposed that is geared towards situations where (parts of) the transitive closure of graph-structured search spaces needs to be explored. The overall relationship between chapters and publications is illustrated in Table 1.1.
7
8
Chapter 4 6 7 8
Related Publications [HM09a, HM09c] [SHL06, WH08a, WH08b, HSL08a, HSL08b] [HSL08a, HSL08b, HML09] [HM09b, HMS10] Table 1.1.: Published Work
8
Chapter 1. Introduction
9
Figure 1.2 Reading graph Ch. 1
Ch. 2
Ch. 3
Ch. 4
Ch. 5
Ch. 6
Ch. 7
Ch. 8
Ch. 9
Ch. 10
9
10
10
Part II. Preliminaries
11
Chapter 2. The Semantic Web The Star Trek computer doesn’t seem that interesting. They ask it random questions, it thinks for a while. I think we can do better than that. – Larry Page (born March 26, 1973)
Contents 2.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.1. The Semantic Web Layer Cake in a Nutshell . . . . . . . . . .
14
2.2. XML & Related Technologies . . . . . . . . . . . . . . . . . . .
17
2.2.1. Extensible Markup Language (XML) . . . . . . . . . . . . . . .
17
2.2.2. Accessing and Querying XML: XPath & XQuery . . . . . . . .
21
2.3. RDF, RDFS & SPARQL . . . . . . . . . . . . . . . . . . . . . .
26
2.3.1. Resource Description Framework . . . . . . . . . . . . . . . . .
27
2.3.2. RDF Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.3.3. SPARQL Query Language for RDF
31
. . . . . . . . . . . . . . .
2.1. Introduction The relationship between the Semantic Web [BLHL01] and the Web is similar to the relationship between a database management system and a file system. While a filesystem offers read and write access to files, the machine itself has no knowledge of the contents of the file and thus operates on buckets of opaque data. In a database system there is a welldefined schema and it is possible for a computer to automatically execute fine-granular operations on the data level including actions on behalf of the user such as triggers [AHV95]. On the Web the access granularity is the Web page; without further pre-processing such as Natural Language Processing [MS99] or Web data extraction [LRNdST02] techniques it is just a tree-shaped data structure. The therein contained information such as flight prices cannot be easily queried and combined with other information on the Web.
13
14 The key idea of the Semantic Web is to improve this situation by creating a Web of data, where machines can find and make use of such information, by giving this data a precise semantics (akin to a database schema), thus creating a global, distributed, decentralized database of knowledge. This is realized in layers of different technologies, which (more or less) build upon one another as shown in Figure 2.1. Chapter Structure Section 2.1.1 gives a brief overview of the lower layers, while Sections 2.2 and 2.3 discuss the layers highlighted in gray in more detail, which are fundamental to many contributions of this thesis. Figure 2.1 Semantic Web layer cake, adapted from [BL07] User Interface & Applications Trust Proof Unifying Logic
SPARQL
RIF RDFS
Crypto
OWL
RDF XML URI/IRI
2.1.1. The Semantic Web Layer Cake in a Nutshell In recent years, different versions of the Semantic Web Layer cake have been proposed, each reflecting the current state of standardization efforts. As a general rule the stabilization of this layered architecture took a bottom up approach, where the lower levels up to the Resource Description Framework (RDF) [MM04] and the RDF Schema (RDFS) [BG04] early reached maturity. Above the RDF/S level, open research questions
14
Chapter 2. The Semantic Web
15
regarding the integration of RDF/S semantics and different versions of the Description Logic [BCM+ 03] based Web Ontology Language (OWL) [MvH04, HPS04], especially in conjunction with rule-based languages, such as e.g. Datalog [AHV95] were the topic of research efforts [HPPSH05, PH07]. Additionally, OWL itself has been studied in different contexts [GHM+ 08, MHS09, PUHM09]. In a recent talk by James A. Hendler [Hen09], an overview of the evolution of the layer cake is given and he argues that the current state of the art in Semantic Web research is not reflected in the current version of the layer cake. Thus, it is to be expected that the Semantic Web layer cake might be adapted further in the future. Fortunately, although the contributions in this thesis build upon the technologies depicted in the Semantic Web layer cake, they do not rely on a strict layering and a further change of the layer cake should have only marginal impact on the presented approach. In the remainder of this section, a brief overview of the lower layers is given. Uniform Resource Identifier (URI)/Internationalized Resource Identifier (IRI) Uniform Resource Identifiers (URIs) [BLFM05] are at the core of the Semantic Web, or the Web in general. They allow to uniquely address a Web resource, e.g. a Web page, but can also be utilized to identify virtually anything, such as people or concepts. Syntactically, many URIs in this thesis are of the form: bla://...[#...], where bla is the scheme part of the URI, and the text in brackets is an optional fragment identifier. The scheme part bla was chosen to stress the fact that the URIs are only used as IDs and cannot actually be retrieved (as might be implied by using http or ftp). Internationalized Resource Identifiers (IRIs) [DS05] are an extension of URIs in the sense that they allow users to create URIs in their native language by extending the set of allowed characters from US-ASCII [Ame97] to Unicode [UC06]; in the remainder only URIs are used. Extensible Markup Language (XML) XML [BPSM+ 08] is the lingua franca of the (Semantic) Web for data serialization and exchange. Similar to HTML [RHJ99] it is also a derivative of SGML [Gol91] and is based on a tree-structured data model, which is serialized to a well-known textual representation. The schema of an XML document can be specified with a Document Type Definition (DTD) [Lau05]. Section 2.2 continues the presentation of XML and also introduces the access and query languages XPath [CD99, BBC+ 07] and XQuery [LS04]. Resource Description Framework (RDF) RDF [MM04] is the data model underlying the Semantic Web. In RDF, information is encoded as a labeled directed graph where nodes represent resources, identified by URIs, and literals, and the labeled edges identify the relationships between them. The edge labels are also URIs, which makes up the salient feature of the Semantic Web: properties and classes can also have properties, which allows for expressing meta data information. The directed labeled edges in the graph can be interpreted as triples of knowledge of the form (subject, predicate, object), where the
15
16 subject and the object are the source and destination nodes and the predicate is the label. RDF can be serialized for data exchange, e.g. using a specific XML-based representation. Section 2.3.1 discusses the RDF data model, and introduces the Turtle [BBL08] notation, which is used throughout this thesis for representing RDF data. RDF Schema (RDFS) RDFS [BG04] is a simple ontology (or vocabulary) definition language for the RDF data model, which facilitates to capture the schema of an application domain, e.g. a biological taxonomy. By associating an RDFS vocabulary with a collection of RDF triples, additional triples can be derived, which are implicitly defined by the RDF schema. Consider for example the triple (lassie, is a, dog), which intuitively captures the fact, that Lassie is a dog. If the RDFS vocabulary now additionally defines that the concept dog is a subclass of the concept animal, a reasoner can infer that Lassie is also an animal. Note that RDF Schema is not to be understood as constraints in the database sense, but rather as derivation rules. This means, that RDFS only implies additional triples, and it was not designed to detect inconsistencies with respect to a specific RDFS vocabulary (e.g. additional information that is not covered in the respective RDFS vocabulary, such as (lassie, stars in, T V series), can also be added to the RDF graph). Section 2.3.2 further describes the RDFS language. SPARQL Query Language for RDF (SPARQL) SPARQL [PS08] is a query language for RDF data. Its design is based on SQL-like clauses using triple patterns with logical variables (join variables), where positions can optionally be replaced by variables (of the forms “?vari ” or “$vari ”). Additionally, filter conditions (e.g. of the form ?vari ≤ ?varj ) can be stated. Variables occurring twice or more act as join variables. A subset of the variables can be declared in the select clause to be the answer variables, whose matches with resources and literals are the answers. The result of a query is thus a set of tuples of variable bindings. Section 2.3.3 provides a more comprehensive introduction to the query language, including the supported query operators, such as union , and optional . Web Ontology Language (OWL) OWL [MvH04] is built on top of RDFS and supports a greater variety of language constructs for specifying more intricate ontology vocabularies. The original specification defined three different flavors of OWL: OWL Lite, which is geared towards users who want to create classification hierarchies and only need simple constraints; OWL DL1 , which is more expressive and still guarantees decidability; and finally OWL Full. OWL Full allows the full syntactic freedom of RDF(S) with the cost that inference is undecidable [HST99]. Recently, OWL Version 2.0 has been released, which is based on the Description Logic SROIQ(D) [HKS06]. One notable innovation is the introduction of profiles to improve 1
DL is a reference to the close relationship of OWL DL to the Description Logic SHOIN (D) [HS99].
16
Chapter 2. The Semantic Web
17
scalability for typical applications [MGH+ 09, PUHM09]. In terms of data representation, RDFS and OWL ontologies can both be represented as a set of RDF triples, which allows SPARQL to be used as query language for both the data level (RDF itself) and the schema level (RDFS and OWL), thus facilitating declarative self-describing systems. This concludes the overview of the Semantic Web layers that are relevant in this thesis. The next section deals with the most important aspects of XML and its related technologies, such as XPath and XQuery.
2.2. XML & Related Technologies One of the main use cases of XML [BPSM+ 08] is the exchange or – more generally – the serialization of domain-specific languages (cf. the Business Process Execution Language for Web Services (WS-BPEL) [AAA+ 07], or the serialization of MARS language fragments described in Chapter 3). Another area where it shines is the encoding of any kind of meta data. Section 2.2.1 recapitulates the basic concepts of the XML language. Section 2.2.2 presents the official W3C2 standards for accessing and querying XML documents: XPath and XQuery.
2.2.1. Extensible Markup Language (XML) Some familiarity of the reader with the XML data model is assumed, and thus only the main concepts that are relevant for the understanding of this thesis are highlighted. For a more comprehensive introduction to XML, the interested reader is referred to [Lau05]. An XML document has a tree structure, which is depicted in Figure 2.2 for an (excerpt) of the fictional flight plan of the carrier “Lufthansa”. The structure of the tree is made up of so-called element nodes, i.e. all inner nodes in the tree are element nodes, e.g. the nodes named lufthansa or airport. Each element node may have further attributes (also called attribute nodes in the remainder), which are essentially name="value" pairs, and are depicted next to the respective element node (e.g. airport has three attributes, of which one has the name code and the value FRA ). Among the element nodes, the single topmost element node is denoted as root node. Finally, text nodes can only appear as leaf nodes of the tree and are surrounded by dashed lines (e.g. the text node Rhein-Main-Flughafen Frankfurt). Leaf nodes of type element, are called empty elements, but may still have attributes (e.g. the node named arrival). There is a mapping from the XML tree represenation to a serialization that is used for data exchange and visualization. For this, the tree has to be traversed in preorder, depth-first (designated as document order). Each encountered element node during this traversal is wrapped into an opening 2
http://www.w3.org/
17
18 Figure 2.2 Flight plan of the carrier “Lufthansa” (excerpt) lufthansa
airport
name
code="FRA" airport country="D" timezone="+1"
...
departure
flight
nr="LH123" flight from="FRA" to="LIS"
hours="14" hours="14" arrival minutes="30" minutes="40"
Rhein-Main-Flughafen Frankfurt
tag (e.g. ) and closing tag (e.g. ). The subtree below this element is then serialized recursively between the opening and closing tag of the parent node, resulting in a nested structure of tags that mimic the tree structure. Attribute nodes are serialized as whitespace-separated name="value" pairs following the element name in the opening tag (e.g. ). For text nodes, the content is serialized directly without further embellishments. The complete flight plan underlying Figure 2.2 would thus be serialized like this:
1 2 3 4 5 6 7 8 9 10 11 12 13
Rhein-Main-Flughafen Frankfurt London Heathrow Lisboa Portela Porto - Francisco Sa Carneiro
18
Chapter 2. The Semantic Web
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
19
John F. Kennedy Airport New York Albany International Airport
Document Type Definition The schema of an XML document (i.e. the correct nesting of elements, and their allowed attributes) can be specified by a Document Type Definition (DTD). DTDs are based on an intuitive syntax: elements are defined according to the scheme . The element-definition is given by a content model, which can be either EMPTY (for elements with no child nodes), ANY (for elements that allow arbitrary content), or an expression of the form “(”regex-pattern“)”. For child nodes of type text the regex-pattern is simply #PCDATA (= “parsed character data”), while for child nodes of type element two aspects can be discerned: how often a specific element can occur, and which elements may occur in what order. To define that a specific element may occur zero times or once is indicated by suffixing
19
20 the element name with a “?”, zero or more times by suffixing the element name with a “*”, and one or many times by suffixing the element name with a “+”, as usual. For specifying that an element has to occur in a certain order below element-name they appear separated by “,”, e.g. a flight plan may contain information about airports and flights, but if both are present, the airport information needs to occur before the flight information in the XML document. An exclusive choice between elements is defined by element-name1 | ...| element-namen . Generally, the nesting of elements in a DTD can also be recursive. The definition of attributes adheres to the scheme . The attribute-name-definition first specifies the type of content, e.g. strings as indicated by CDATA (“character data”). A special type are keys (indicated by ID) and references to keys (indicated by IDREF). Keys impose the restriction that for each occurrence of a key attribute in the XML document, the associated attribute values are mutually different (i.e. there are no two occurrences of the same key attribute with the same value). Consequently, in the case of the flight plan example, the from and to attributes of flights uniquely identify the respective airports. After the definition of the type of the attribute, default constraints can be specified, where #REQUIRED means that the attribute needs to be present for each element of the type. If the attribute is #IMPLIED, it can be left out. Constant attribute values can be defined by specifying them to be #FIXED. Sometimes it can be beneficial to associate aliases with repetitive text fragments. This can be done with the syntax . This way, so-called entities are defined that can later be referenced by &reference; and result in a textual replacement of the reference with the associated text fragment. The main purpose is to keep DTD descriptions concise and to cater for special characters. The DTD for the flight plan example is shown here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
20
Chapter 2. The Semantic Web
21
Namespaces The purpose of namespaces is to define XML vocabularies, where each element and attribute has a well-understood real-world “semantics” [BHL+ 09]. Elements in this vocabulary are then identified by so-called expanded names (i.e. pairs consisting of a namespace name and a local name). This way, exchange and reuse of XML documents that use different vocabularies on a Web-wide scale is possible. Syntactically, namespaces can be defined by using the predefined attribute xmlns:ns like this: . Now, in this element node and all of its child nodes, the namespace "foo://my/namespace/" can be referenced by its associated name ns. For instance, the element name ns:child-elem corresponds to foo://my/namespace/child-elem. RDF(S) also uses namespaces to represent URIs more concisely, and to (informally) reference the related RDFS vocabulary (e.g. the Friend of a Friend RDFS vocabulary uses the namespace "http://xmlns.com/foaf/0.1/" with the associated name foaf). Note that namespace information cannot be captured in regular DTDs, which is why in the remainder a nearly-DTD syntax is used. A further application area for namespaces is discussed in Chapter 3, where they are used to identify different language fragments in a single XML tree. XML Schema Datatypes To resolve the shortcomings of DTDs, especially with respect to namespaces and the missing support for data types, the W3C has standardized XML Schema [SMT00, GSMT09, PGM+ 09], which uses XML documents to specify the schema of XML markup languages and supports built-in data types as well as the creation of new data types. Here, especially the built-in data types of XML Schema are relevant, since they are part of both the XQuery typing system and the RDFS typing system, and will also be used in this thesis. These basic data types reside in the namespace http://www.w3.org/2001/XMLSchema (in the remainder abbreviated with the prefix xs) and comprise the typical data types, such as xs:integer, or xs:double, but also data types pertaining to time. The most relevant ones in this context are xs:time, which encodes an instant of time that occurs every day with the format “hh:mm:ss” (e.g. “12:00:00”); xs:dateTime, used for representing specific time points and has the format “-?YYYY-MM-dd’T’hh:mm:ss” (e.g. 2010-07-11T20:30:00); and xs:dayTimeDuration for durations of time with the format “PnY?nM?nD?’T’nH?nM?nS?” (nY = number of years, etc.), e.g. “PT1H10M” would represent one hour and ten minutes. To be able to uniquely distinguish between months and minutes, the token “T” functions as a divider.
2.2.2. Accessing and Querying XML: XPath & XQuery Accessing XML XPath [CD99, BBC+ 07] builds on the abstract representation of XML documents as trees and uses so-called location paths to navigate the XML tree structure in order to select the sequence of desired nodes3 . 3
XPath 1.0 [CD99] was defined over a set of nodes, but beginning with XPath 2.0 [BBC+ 07] the result is a sequence of nodes in document order. In the remainder the term nodes includes element nodes,
21
22 A location path (lp) consists of several path steps, i.e. lp := (step1 , . . . , stepn ), and the intuition behind the evaluation of location paths is that the sequence of nodes selected by path stepi is used as input (referred to as context nodes) for the evaluation of path stepi+1 . Each of these path steps consists of a mandatory navigation axis, and an optional node test and an optional predicate, which is syntactically represented like this: “axisname::nodetest[predicate]”. Two types of location paths can be discerned, relative location paths and absolute location paths, similar to relative and absolute paths in the UNIX filesystem [Tan92]. In the context of this thesis, only absolute paths are considered, i.e. all location paths use the root node of the XML tree as initial context node. The only relevant navigation axes in this thesis are the “child” (abbreviated as “/”) and “descendant” (abbreviated as “//”) axes. The former selects the sequence of all child nodes of all context nodes and the latter the sequence of all descendant nodes, i.e. the XPath “descendant::flight” (abbreviated “//flight”) would select all flight nodes in the flight plan XML tree from above. In this example, “flight” is the node test restraining the selected sequence to those element nodes whose name is flight. To select attributes, the attribute name has to be prefixed by the “@” character, e.g. the XPath “//@from” would select all attributes in the flight plan example with the name from. Example 1 highlights the syntactic varieties of location paths (including predicates) wrt. the excerpt of the flight plan shown above, which are relevant for the understanding of this thesis. From this point on, only the abbreviated syntax for location paths is used, which allows for a more concise presentation. Example 1 The XPath “/lufthansa/flight[@from="FRA" ]” returns all flights that leave from the airport with the IATA code “FRA”; the XPath “//flight[@from="FRA" ]” would return the same results. To access the name of the airport with this IATA code, the XPath “//airport[@code="FRA" ]/name/text()” can be used. This location path first retrieves the sequence (in document order) of all descendant element nodes with the name airport starting from the root node. Then, the sequence is restricted to those where the value of the attribute code is equal to "FRA" (which is just one node, since the attribute code is a key) and for this single element the text “content” of the name child node is returned. The herein presented syntax covers only a part of the available language constructs; for a more comprehensive introduction to the syntax and semantics of XPath, the interested reader is referred to [CD99, GKP03, GKP05, BBC+ 07, BK08]. Querying XML: XQuery For many real world queries the expressive power of XPath is too limited, e.g. it does not support full joins or aggregation. Therefore, the W3C standardized a general-purpose query language for XML called XQuery [BCF+ 01], which is Turing-complete [Kep04] and supports XPath natively as part of its syntax. text nodes, and attribute nodes.
22
Chapter 2. The Semantic Web
23
XQuery is both a query language, and a full-fledged functional programming language that operates on XML data. Consequently, the variables occuring in XQuery expressions are immutable, and the invocation of functions has no side-effects. Furthermore, XQuery is strongly typed, which means that many errors can be detected by static analysis before execution. The type system is based on sequences of nodes and atomic values defined by XML Schema [PGM+ 09]. Another interesting result of the close ties to functional programming is that XQuery only has expressions, and thus the only possibility to store the result of an XQuery expression for later use is offered by the let clause that binds the result of an expression to a temporary variable. The formal specification of XQuery, including the typing system, can be found at [DFF+ 07]. The body of an XQuery expression is composed of single expressions. The most frequently used single expressions are flwor (pronounced flower) expressions that take an ordered sequence of items (e.g. nodes selected by an XPath expression or atomic values, such as numbers, strings, etc.) as input. The (abstract) syntax of flwor expressions looks as follows:
1 2 3 4
(for -clause | let -clause)+ where -clause? order by- clause? return -clause
The for $var1 in exp1 , ..., for $varn in expn clauses bind the variables $var1 , . . . , $varn to the results of evaluating the corresponding expressions exp1 , . . . , expn . Here, each expression expi+1 may reference the variables bound by the expressions exp1 , . . . , expi . The result is a stream of tuples, which contains all matching variable combinations as illustrated in Example 2. In contrast, the let $var:= exp clause does not iterate over the items in the result of expr, but binds the variable $var to the whole sequence. The where clause is used to prefilter the generated tuples, the order by clause presorts the tuples ascending or descending , and the return clause is evaluated once for each tuple. It generates for each tuple a result that is appended to a (initially empty) sequence, which is the overall result of the flwor expression. Example 2 Consider this XQuery:
1 2 3 4 5
{ for $ap in //airport for $from in //flight[@to=$ap/@code]/@from let $vFrom := string($from) let $vCode := string($ap/@code)
23
24
6 7 8 9 10 11
where $ap/@country = "P" return {$vFrom} {$vCode} }
It iterates over the sequence of ($ap, $from) tuples, where at each step $ap is bound to an airport element (which represents a Portuguese airport), and subsequently $from is bound to the from attribute of a flight element. Thus, with n1 the number of results returned by the XPath in Line 2 and n2 the number of results returned by the XPath in Line 3, the loop is executed n1 ∗n2 times. For each of these tuples, an XML pattern is filled in the return clause (the string() function used in Line 4 and 5 returns the value of the referenced attributes, which is necessary since the XPath expression returns the whole attribute node as name=value pair:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
FRA LIS FRA OPO MUC OPO Note that the result of an XQuery expression is not required to be an XML document. Example 3 At a first glance it seems that the flight from “FRA” to “LIS” only takes two hours and ten minutes. This is due to the different timezones of the airports, and to determine the correct flight duration this XQuery expression can be used:
1 2
declare namespace fn="http://www.w3.org/2005/xpath-functions" ; declare namespace xs="http://www.w3.org/2001/XMLSchema" ;
24
Chapter 2. The Semantic Web
25
3 4 5 6 7 8 9 10
(: Adjust departure and arrival time :) let $deptEl := //flight[@from="FRA" and @to="LIS"]/departure let $dept := xs:time(fn:concat($deptEl/@hours, ":", $deptEl/@minutes, ":00")) let $arrEl := //flight[@from="FRA" and @to="LIS"]/arrival let $arr := xs:time(fn:concat($arrEl/@hours, ":", $arrEl/@minutes, ":00"))
11 12 13 14 15 16 17 18 19 20 21 22 23 24
(: Adjust timezones :) let $fraTZ := //airport[@code="FRA"]/@timezone let $lisTZ := //airport[@code="LIS"]/@timezone let $fromTZ := if ($fraTZ < 0) then xs:dayTimeDuration(fn:concat("-PT", fn:abs($fraTZ), "H")) else xs:dayTimeDuration(fn:concat("PT", fn:abs($fraTZ), "H")) let $toTZ := if ($lisTZ < 0) then xs:dayTimeDuration(fn:concat("-PT", fn:abs($lisTZ), "H")) else xs:dayTimeDuration(fn:concat("PT", fn:abs($lisTZ), "H"))
25 26
return $arr - $dept + $fromTZ - $toTZ
In Line 5 to 10 the departure and arrival times of the flight are extracted and converted to the XML Schema datatype xs:time (the fn:concat function provides string concatenation), and in Line 13 to 24 the respective timezones of the originating and destination aiport are extracted and converted to the xs:dayTimeDuration data type. The computation of the overall duration in Line 26 is then automatically done with respect to the types of the bound variables and the result is:
1
PT3H10M
Note that the result itself is also of type xs:dayTimeDuration. Being also a functional programming language, XQuery allows users to define their own functions. These functions need to be defined before they can be used, like in the programming language C [KR88]. The syntax for defining new functions is straightforward and makes use of the fact that XQuery expressions always have a return value (for XQuery functions whose body consist only of an XPath expression, i.e. do not have a return clause, the result of the expression in the function body is used as return value):
25
26
1 2 3 4 5 6
declare function ns:function-name [($var1 [as xs:dataType], ..., $varn [as xs:dataType])] [as xs:dataType] { single expression using $var1 , ..., $varn };
Example 4 Factorial is a well-known recursive function with a concise definition: ¾ ½ 1 if n = 0 f actorial(n) = n ∗ f actorial(n–1) otherwise The mathematical definition directly translates to:
1 2 3 4 5
declare function local:factorial($n as xs:integer) as xs:integer { if ($n = 0) then 1 else $n * local:factorial($n - 1) }; local:factorial(5)
Here, the whole function body (if (condition) then single expression1 else single expression2 ) is an XPath 2.0 expression and has no explicit return clause. The result of the invocation local:factorial(5) returns as expected 120 as answer. The next section moves up the Semantic Web layer cake one level with a brief introduction of the Resource Description Framework (RDF) and its related concepts.
2.3. RDF, RDFS & SPARQL XML has its strength for the exchange of (almost) arbitrary languages. As such, it is also possible to assert facts about arbitrary concepts. However, the downside of this generality is that there is no unique way of representing meta data, i.e. different XML trees might convey the same intended meaning. Another problem is that XML documents represent trees, which is in contrast to the way information is typically interconnected, or spread over several different locations4 . For this, the W3C standardized the Resource Description Framework (RDF), which allows to state facts about arbitrary concepts in a unique (graph-based) formalism. The 4
Compare http://www.w3.org/DesignIssues/RDF-XML.html for a more elaborate differentiation of RDF and XML following a similar train of thought.
26
Chapter 2. The Semantic Web
27
basic unit are triples of knowledge that can be spread over several locations. Section 2.3.1 introduces the basic notions of RDF. The focus of its associated schema language RDFS, which is presented in Section 2.3.2, is to define vocabularies for specific domains, i.e. to define the relevant concepts (denoted as classes) and their relationships (realized as properties) in this domain. Section 2.3.3 concludes the presentation of this “layer” with a discussion of SPARQL, the query language for the RDF data model.
2.3.1. Resource Description Framework RDF is based on the abstract concept of resources, which are identified by URIs. Here, the term resource is to be understood in its broadest possible meaning. It can mean either a Web resource, i.e. a HTML page, or a picture, but also an object in the “real” world, e.g. a person, or an abstract idea, such as philosophy. The idea is that RDF can be used to assert meta data about these resources, or – more generally speaking – to express knowledge about certain things. Figure 2.3 The Air DB as RDF graph (excerpt) "BCI"
airdb:LRE travel:flightTo
travel:flightTo
travel:iataCode
travel:flightTo "Barcaldine"
airdb:BCI
airdb:RMA
travel:flightTo
travel:airportName
travel:nearestRailwayStation :b1 travel:name
travel:distance travel:railwayCode
"Alice Springs, Railway"
"XHW"
"49"
Figure 2.3 shows an RDF graph that describes information about airports gleaned from the Air DB5 . Each single directed, labeled edge in this graph correspondes to an RDF statement (also called triple of knowledge) of the form (subject, predicate, object). The subject is the source of the edge, i.e. the resource about which something is stated; the predicate is the label of the edge and captures the relationship between subject and object; and the object is the destination of the edge. The notion of an RDF graph 5
http://theiardb.com
27
28 as a set of correlated triples plays an important role for the SPARQL query language (cf. Section 2.3.3), where the from clause identifies the relevant context graphs that are accessed in the query. Example 5 Consider the triple (airdb:BCI, travel:flightTo, airdb:LRE)6 , which states that there is a flight from the resource airdb:BCI to the resource airdb:LRE. Without an associated RDFS vocabulary, the type of the subject and the object in this triple remains unclear (since it has not been explicitly stated in the graph). More formally, the set of legal RDF statements is defined as: stmtRDF ⊆ (U ∪ B) × U × (U ∪ B ∪ L), where U is the set of all valid globally unique URIs, B the set of locally unique (local to the document that is) identifiers (called blank nodes), and L is the set of literals ranging over the XML Schema data types7 . Note that predicates are also resources (i.e. are identified by a URI), which facilitates to further describe them in the RDF formalism itself. One use case for blank nodes is to model complex data (e.g. a postal address), such as the railway information shown in Figure 2.3, where the properties of railway stations are all associated with the same blank node :b1. In contrast, the two airports that can be reached via flights (airdb:LRE and airdb:RMA) have been assigned globally unique URIs with the consequence that if they are referred to in some other RDF document (also called an RDF graph) the two subgraphs can easily be combined to one graph. The idea is that all available RDF graphs together can be seen as one huge database of knowledge. RDF Serialization There exist different formalisms for serializing RDF graphs. In the remainder, the Turtle [BBL08] serialization will be used, because it allows for a concise presentation and also is used in the SPARQL query language (cf. Section 2.3.3)8 . The following Turtle snippet sketches the syntax varieties:
1
@prefix xs: .
2 3 4
res1 a cl; prop1 res2 ; prop2 res3 , res4 . res4 prop3 [ prop4 "foo" ^^xs:string ].
The resource (or concept) identified by res1 is an instance of class cl (“a” is a shorthand for rdf:type), has the property prop1 , resulting in res2 , and it is related via prop2 to resources res3 and res4 , where res4 in turn is related by property prop3 to an unnamed 6
Instead of writing URIs completely, they are usually abbreviated using namespaces similar to XML. By convention, URIs are shown in ellipses and literals are enclosed in rectangles in the graph representation, cf. Figure 2.3. 8 There is also a serialization based on XML, called RDF/XML [Bec04].
7
28
Chapter 2. The Semantic Web
29
resource (i.e. a blank node) that has a property prop4 with value "foo" of the data type xs:string. The part of the Air DB graph with information pertaining to the three airports depicted in Figure 2.3 would thus be serialized like this:
1 2 3
@prefix airdb: . @prefix travel: . @prefix xs: .
4 5 6 7 8 9 10 11 12
airdb:BCI travel:iataCode "BCI" ^^xs:string; travel:airportName "Barcaldine" ^^xs:string; travel:nearestRailwayStation [ travel:name "Alice Springs, Railway" ^^xs:string; travel:railwayStationCode "XHW" ^^xs:string; travel:distanceToRailwayStation "49" ^^xs:integer ]; travel:connectionTo airdb:LRE, airdb:RMA.
13 14 15 16
airdb:LRE travel:iataCode "LRE" ^^xs:string; travel:airportName "Longreach" ^^xs:string; travel:flightTo airdb:BCI, airdb:BKQ, airdb:BNE.
17 18 19 20
airdb:RMA travel:iataCode "RMA" ^^xs:string; travel:airportName "Roma" ^^xs:string; travel:flightTo airdb:BCI, airdb:BNE, airdb:CTL.
2.3.2. RDF Schema In line with XML Schema, which is expressed as an XML document, RDF Schema (RDFS) [BG04] leverages RDF statements to describe vocabularies for specific application domains by specifying the relevant entities of the domain as classes and their attributes as properties. RDF graphs using such an RDF vocabulary can then specify individuals as instances of these classes, and by using an RDF reasoner, additional knowledge about the individual can be inferred. Actually, RDFS is itself an RDF vocabulary that is associated with a fixed semantics based on model theory [Hay04], which solves the chicken or egg causality dilemma and allows to build new vocabularies by using the properties and classes of the RDFS vocabulary. Yet, where the purpose of an XML schema is to validate an XML tree against a set of allowed elements and attributes with a correct nesting, RDFS behaves differently. The former imposes constraints on XML trees that need to be satisfied to be an instance of a specific XML markup language, while the latter defines the set of entities and their relationships of a specific vocabulary.
29
30 This is a list of the most prominent RDFS properties with a short description of their “built-in” semantics (the offical RDFS reference can be found in [BG04]): 1. rdf:type (shortcut a): (i, rdf:type, cl) asserts that i is an instance of class cl, 2. rdfs:domain: (prop, rdfs:domain, cl) asserts that for all triples (ri , prop, rj ) the subject ri is of type cl, i.e. the additional triples (ri , a, cl) are implied, 3. rdfs:range: (prop, rdfs:range, cl) asserts that for all triples (ri , prop, rj ) the object rj is of type cl, i.e. the additional triples (rj a, cl) are implied, 4. rdfs:subClassOf: Like in the object-oriented world, (cl1 , rdf:subClassOf, cl2 ) asserts that all instances of cl1 are also instances of cl2 , e.g. if the graph contains the triple (s, a, cl1 ), then the triple (s, a, cl2 ) is implied, 5. rdfs:subPropertyOf: Analogous to above, (p1 , rdf:subPropertyOf, p2 ) asserts that all resources that are related by p1 are also related by p2 , e.g. if the graph contains the triple (s, p1 , o), then the triple (s, p2 , o) is implied. The most notable RDFS classes are the class rdfs:Class, which identifies the class of all resources, and the class rdf:Property, which is the class of RDF properties. It holds that rdf:Property is an instance of rdfs:Class, i.e. (rdf:Property, a, rdfs:Class). This is the RDFS vocabulary that accompanies the airport graph in Figure 2.3:
1 2 3
@prefix rdfs: . @prefix travel: . @prefix xs: .
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
travel:flightTo rdfs:domain travel:Airport; rdfs:range travel:Airport. travel:iataCode rdfs:domain travel:Airport; rdfs:range xs:string. travel:airportName rdfs:domain travel:Airport; rdfs:range xs:string. travel:nearestRailwayStation rdfs:domain travel:Airport; rdfs:range travel:RailwayStation. travel:name rdfs:domain travel:RailwayStation; rdfs:range xs:string. travel:railwayCode rdfs:domain travel:RailwayStation; rdfs:range xs:string. travel:distance rdfs:domain travel:RailwayStation; rdfs:range xs:string.
Example 6 Reconsider the triple (airdb:BCI, travel:flightTo, airdb:LRE) from Example 5. Given the above RDFS vocabulary, a reasoner can infer the additional triples: • (airdb:BCI, a, travel:Airport), and
30
Chapter 2. The Semantic Web
31
• (airdb:LRE, a, travel:Airport) by applying the inference rules (2) and (3) from above. Concluding Remarks Although not relevant in this thesis, RDF(S) also defines a default vocabulary for lists, sequences, and bags (cf. [MM04, BG04]). A vital characteristic of the languages in the Semantic Web layer cake that are associated with the Semantic layer is that they can all (RDF, RDFS, and OWL) be represented as sets of triples. This facilitates the creation of a comprehensive query language that covers all of these Semantic Web languages called SPARQL, which is the focus of the next section.
2.3.3. SPARQL Query Language for RDF SPARQL is the official query language for RDF graphs, which has been standardized by the W3C9 . It has close ties to SQL [CB74, EMK+ 04] with respect to the division of a query into a select , from , and a where clause:
1 2 3
select ?var1 ... ?varm from graph1 , ..., graphn where query-body
Like in SQL the first clause to be evaluated is from , which addresses the set of relevant input graphs, e.g. filenames or HTTP URLs where a Turtle or RDF/XML file is provided. By resolving all references to graphs listed in the from clause, the graph is computed as the merge of all referenced graphs. If the graphs to be merged contain blank nodes, they are standardized apart, i.e. new labels will be generated for each blank node (cf. [Hay04]). All examples in this section use as graph the Air DB example from Section 2.3.1 and the from clause is omitted. The where clause consists of graph patterns that bind variables and the select clause performs a final projection on the result variable bindings. SPARQL Syntax The specification of the (preliminary) result variable bindings is done via graph patterns in the where clause similar to QBE (Query by Example) [Zlo77], a query language for the relational model. There, queries can be specified by providing sample instances of the result relations using shared variables to express joins over the relational schemata. The most simple basic graph patterns (bgp) are of the form: bgpSPARQL ⊆ (V ∪ U ∪ B) × (V ∪ U ) × (V ∪ U ∪ B ∪ L), that are defined similar to stmtRDF with the addition that a fresh set of variables V (V ∩ (U ∪ B ∪ L) = ∅) is introduced (indicated by a leading “?” or “$” followed by a textual identifier) that can appear at every position in the RDF statement. The set of valid graph patterns is then defined inductively as follows: 9
http://www.w3.org
31
32 • Every bgp is a graph pattern, • if X and Y are graph patterns, then X and Y, X optional Y, and X union Y are graph patterns, and • if X is a graph pattern and R is a predicate (usually, a comparison of the form ?X op ?Y, or ?X op const, or built by disjunction and conjunction over such expressions), then X filter R is a graph pattern. The bgps are stated in the Turtle [BBL08] notation, which was discussed in Section 2.3.1 as a human-readable serialization format for RDF graphs (cf. Example 7). SPARQL Semantics SPARQL relies on graph pattern matching to extract variable bindings from the graph. Leveraging the simple triple representation, bgps are used to identify the desired edges in the RDF graph, where variables are bound to each matching occurrence in the graph. The result of evaluating such a bgp is a set of mappings (also called mapping set). For instance, the evaluation of the bgp (?A, travel:iataCode, ?C) on the Air DB RDF fragment from above results in the following mapping set (mapping sets are in the remainder always depicted as tables): ?A
?C
airdb:BCI airdb:RMA airdb:LRE
"BCI" "RMA" "LRE"
The evaluation is then defined inductively on the structure of the graph patterns: • a bgp is evaluated by determining all matching occurrences of the bgp in the graph and binding the associated variables accordingly, • if X and Y are graph patterns, then X and Y means computing the join of the results of evaluating X and Y, X optional Y means computing the left outer join of the results of evaluating X and Y, and X union Y means computing the union of the results of evaluating X and Y, • if X is a graph pattern and R is a built-in condition, then X filter R means restricting the result of evaluating X to the mappings that satisfy R. In the remainder, SPARQL is further introduced with the help of examples to provide the reader with the required background for its use in this thesis. For a more rigorous treatment of SPARQL, the publications [PAG06, PAG09, Sch09a] are recommended. Example 7 Suppose, a user decides to query for all airports with their IATA codes and their name. The following SPARQL query yields the desired result:
1 2
prefix airdb: prefix travel:
3
32
Chapter 2. The Semantic Web
4 5 6 7 8
33
select ?C ?N where { ?A travel:iataCode ?C. ?A travel:airportName ?N }
The result of the evaluation is the mapping set: ?C
?N
"RMA" "LRE" "BCI"
"Roma" "Longreach" "Barcaldine"
Note that not all variable bindings are returned, but only those that appear in the select clause. Also, the query introduces the syntactic shortcut for the “ and ” operator, the “.”. Principally, the full power of the Turtle syntax can be used in the body of the where clause, where variables can appear at either subject, predicate, or object position. Example 8 The following SPARQL query constrains the results of the query in Example 7 to those where the name of the airport is "Barcaldine"
1 2
prefix airdb: prefix travel:
3 4 5 6 7 8 9
select ?C ?N where { ?A travel:iataCode ?C. ?A travel:airportName ?N filter (?N = "Barcaldine" ) }
The single result is as expected: ?C
?N
"BCI"
"Barcaldine"
Example 9 The query in Example 7 only returned the IATA codes and names of the airports. But for the "Barcaldine" airport there is also information available about its nearest railway station. This SPARQL query additionally returns the code of the nearest railway station, if one is present:
33
34
1 2
prefix airdb: prefix travel:
3 4 5 6 7 8 9 10 11 12 13
select ?C ?N ?RW where { ?A travel:iataCode ?C. ?A travel:airportName ?N optional { ?A travel:nearestRailwayStation [ travel:railwayStationCode ?RW ] } }
The result is similar to the one in Example 7 with the additional railway code of the nearest railway station to the airport "Barcaldine" : ?C
?N
?RW
"RMA" "LRE" "BCI"
"Roma" "Longreach" "Barcaldine"
"XHW"
Note that the variable ?RW is not bound for the first two results. SPARQL vs. RDFS Currently, SPARQL operates directly on the RDF level as indicated in Figure 2.1. The most notable consequence is that a SPARQL query will be evaluated only on the explicitly stated RDF triples. The downside of this behavior is that queries might behave unexpected in the context of associated RDFS vocabularies, e.g. when querying for implicitly stated knowledge. To achieve the expected behavior, SPARQL evaluation can be combined with an RDFS reasoner. SPARQL 1.1 [HS10] addresses the issues concerning RDFS under the concept of an entailment regime. This completes the presentation of foundational technologies. Chapter 3 introduces the Semantic Web framework MARS, which builds on top of this technologies and functions as middleware for the implemented proof-of-concept implementation.
34
Chapter 3. The MARS Framework He who is ignorant of other languages is ignorant of his own. – Johann Wolfgang von Goethe (August 28, 1749 – March 22, 1832)
Contents 3.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2. Overview: The MARS Framework . . . . . . . . . . . . . . . .
36
3.2.1. Hierarchical Language Structure . . . . . . . . . . . . . . . . .
37
3.2.2. State, Communication, and Data Flow . . . . . . . . . . . . . .
42
3.3. The Action Component . . . . . . . . . . . . . . . . . . . . . . .
45
3.3.1. Atomic Constituents . . . . . . . . . . . . . . . . . . . . . . . .
45
3.3.2. Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.4. Technical Realization . . . . . . . . . . . . . . . . . . . . . . . .
53
3.1. Introduction ECA rules (event-condition-action rules) have been extensively studied in the database community in the context of active databases [ANC96]. They provide a high-level abstraction of active rules of the form “on event if condition do action” with a modularization into clean concepts with a clear declarative semantics. There is also a close connection between the different parts of an ECA rule and its operational realization, which can be separated into three distinct evaluation phases: 1. Event: In the event phase all events are matched and collected, afterwards 2. Condition: it is checked, whether the events satisfy some condition, and 3. Action: a set of associated actions is executed. The condition phase can on closer examination be divided into two parts: first, additional queries can be evaluated to gather context information, and then a test is evaluated on top of the enriched data as shown in Figure 3.11 . 1
Pictures and examples in this chapter are adapted from [MAA05b, BFMS08].
35
36 Figure 3.1 Evaluation phases of ECA rules Event Condition static dynamic query event test test
collect
Action dynamic action act
In the beginning, reactive behavior was only supported by specialized systems, such as [CCCR+ 90]. Nowadays, active rules can be found in the disguise of SQL triggers in most off-the-shelf relational database systems. The main problem with SQL triggers is that they only operate on a local view of the data, e.g. only information about the immediate database update can be used in the condition part. As a further limitation, the updates themselves are given procedurally, i.e. reasoning about the effects of the action part in the trigger statement would require introspection of the program code. Lastly, everything has to be encoded in the SQL language, although there already exist proposals that are specialized on certain aspects of ECA rules, e.g. event algebras (cf. [CKAK94, CM94]). The MARS (Modular Active Rules for the Semantic Web) framework has been developed since 2005 as a contribution to the EU Network of Excellence (NoE) “Reasoning on the Web with Rules and Semantics” (REWERSE) [MAA05a, MAA05c, AEM09]. It provides a “global architecture and the general markup principles of a General Framework for Behavior in the Semantic Web that is based on the idea of ECA rules over heterogeneous event, condition, and action formalisms” [BFMS06] and overcomes the abovementioned limitations of SQL triggers. Chapter Structure The next section gives an overview of the MARS framework with a special focus on the embedding of heterogeneous event, condition, and action languages, as well as the representation of state in MARS and the data flow between MARS components and to/from the processors of the constituents via logical variables. Section 3.3 introduces the CCS process algebra as a sample language for the action component, and the embedding of different languages as atomic constituents of the process. The presentation of the technical realization in Section 3.4 wraps up the introduction of the MARS framework.
3.2. Overview: The MARS Framework The strength of MARS is that it allows the interplay of heterogeneous event, condition, and action formalisms. This is achieved by a global architecture that leverages XML’s inherent tree structure for the representation of algebraic component languages, which are associated with a language processor as elaborated in Section 3.2.1. Furthermore, the data flow between processors of these component languages is provided by a high-level data flow model based on logical variables as presented in Section 3.2.2.
36
Chapter 3. The MARS Framework
37
3.2.1. Hierarchical Language Structure MARS builds on the concept of an ECA rule as depicted in the UML diagram in Figure 3.2. The rule model describes the constituents of an ECA rule, where the condition component is subdivided into a query and a test component. The languages model associates with each component a specific (algebraic) language, making the approach parametric in the component languages. Thus, users can write ECA rules using different languages for each component. Each language has a URI that uniquely identifies it and that is used to find the associated language processor as described below. Figure 3.2 MARS components and associated languages Rule Model
ECARule 1..*
1
* Query Component ↓uses Event Language
0,1 Test Component ↓uses
↓uses Query Language
Test Language
Languages Model
↓uses Action Language
¤
¤
¤
ActionComponent
Language Name URI Syntax
¤
EventComponent
0..1 ConditionComponent
Processor ↓implements service/plugin
The UML diagram in Figure 3.2 directly translates to an XML markup for ECA rules:
1 2 3 4 5 6 7
Event specification Query specification Test specification Action specification Here, the event, condition, and action specifications use XML markup elements of their
37
38 respective languages’ namespaces. This way, namespaces are used to relate the components given in certain languages with the corresponding language processors. The associated nearly-DTD of ECA rules is:
1 2 3 4 5 6 7
Algebraic languages Algebraic languages consist of: • (i) atomic elements, and • (ii) composite expressions inductively defined over (i) using operators. They can be represented as an abstract syntax tree, where each inner node is an operator of the target language and the respective child nodes are the associated operands. By using only algebraic languages for the ECA rule components, the inherent tree structure of XML documents combined with the identification of language fragments by associated namespace URIs facilitates an (almost) arbitrary interplay of heterogeneous languages at different nesting levels, e.g. each subexpression in the abstract syntax tree could be given in a different language (as indicated by the ANY-other-language content model of the event, query, test, and action nearly-DTD of ECA rules above). The embedding of heterogeneous component languages is further showcased for the event component: event algebras elegantly capture the notions of events and have the benefit of a concise theory with existing algorithmic approaches for the detection of events. Stemming from research in active databases, a powerful and well-known representative is the Snoop event algebra of the Sentinel system [CKAK94, CM94], which underlies the sample implementation of the MARS event component [BFMS08]. The idea of event algebras is to describe composite events by combining atomic events with appropriate operators, e.g. “e1 and/or e2 ”. These operators can be nested in a tree structure to form event expressions that capture more complex event patterns; the leaf nodes are constituted by the atomic events. The evaluation of an event expression, i.e. the composite event detection, is done incrementally. When a match for the overall event expression is determined, the sequence of all “contributing” events is reported and the (composite) event is detected. Example 10 gives some intuition to the evaluation of composite events. Example 10 The event expression E1 := and(sequence(e1 , e2 ), sequence(e3 , e4 )) is detected, if the event e2 occurs after e1 , and the event e4 occurs after e3 . This is intu-
38
Chapter 3. The MARS Framework
39
itively the case for the event sequence e1 , . . . , e2 , . . . , e3 , . . . , e4 but also for the sequences e3 , . . . , e1 , . . . , e2 , . . . , e4 , or e1 , . . . , e3 , . . . , e2 , . . . , e4 . The description of events in Example 10 was focused on the detection of composite events. Yet, in reality events can be rather complex, e.g. they might have parameters (like delayed(LH123) to indicate that the flight “LH123” is delayed). In the MARS framework, the Snoop event algebra is thus extended by atomic event specifications that may contain logical variables, which will be presented in Section 3.2.2. Atomic Event Specification Each composite event can be represented as an event component tree (cf. Example 12), where the leaves are the atomic events. These events are communicated as XML fragments in MARS (e.g. ), and for the detection of such atomic events, query languages that state conditions against the XML fragment are used, such as XML-QL [DFF+ 99] that allows to specify atomic events by a pattern and also allows to bind a fragment of the event to a variable (with the syntax ). Example 11 Assume a friendly travel agency that informs users about flight delays. This could be realized with this ECA rule:
1 2 3 4 5 6 7 8 9
10 11 12 13 14
On registration of this event, the ECA rule processor registers the event component at an appropriate Atomic Event Matcher (AEMs). Now, let the atomic (XML) event be raised by an airport traffic monitoring system, which is then propagated to the relevant AEM (here the XML-QL processor) that matches the specification and returns the tuple (flight/“LH123”) to the ECA engine. Afterwards, the action component of the ECA rule is processed (since the event is detected) and the processing starts with the initial tuple (flight/“LH123”).
39
40 Example 11 presented how events are detected on an architectural level and also highlighted the evaluation of language fragments in the MARS framework. Example 12 builds on the therein introduced notion of an AEM and demonstrates the interaction of the Snoop algebra operators (in namespace snoop) with the detection of atomic events (handled by the XML-QL processor associated with the namespace xqm). Example 12 Consider the following fragment of an ECA rule:
1 2 3
4
6
10 11 12 13 14 15 16
...
It only shows the event part of the rule, which waits for the “arrival” of two events and . The XML tree (where attributes have been omitted) of the language fragment in the snoop namespace (Lines 3 to 13) looks like this: snoop:And
xqm:Event
xqm:Event
foo:A
foo:B
On registration of this composite event, the ECA rule processor registers the event component at the Snoop Composite Event Detection (CED) engine, which in turn registers the atomic event specifications at the XML-QL AEM. When the event is detected by the XML-QL AEM, it will propagate the event to the Snoop engine, which stores the
40
Chapter 3. The MARS Framework
41
detected event. Some time later the event is detected by the XML-QL AEM, and also propagated to the Snoop engine. Now, the composite event is detected and reported to the ECA engine. The ECA engine then continues processing the ECA rule with the condition part, if present, or otherwise directly with the action part. (The complete syntax (including the XML markup) of the Snoop event algebra and its semantics, as well as the handling of logical variables and the more intricate details of event detection are not relevant for the understanding of this thesis and can be found in [BFMS08]) Opaque Embedding Sometimes it makes sense to embed language fragments as opaque leaf elements (i.e. for program code, mainly in the case of queries, test, and also for actions), which is shown in Example 13. This style of embedding is called opaque, since the subtree below the opaque element is not an XML tree adhering to the MARS principles, but can be a code fragment in an arbitrary programming language that needs to be parsed and processed accordingly. Example 13 For this ECA rule fragment, only the test part is given:
1 2 3 4 5
...
6 7 8 9 10
...
It contains an opaque test in XPath (Line 6), which is satisfied, when the value of the input variable distance is smaller than 1000. Signatures As indicated by Example 13, it is not sufficient to determine the relevant language processor for a language fragment, but in addition the required input variable bindings need to be sent to the language processor as well. For this, the MARS framework manages the variable usage characteristic – in the remainder called signature – of each embedded language fragment. The signature is a profile that captures which variables are used, which have to be supplied as input (logic programming: negative use), and which can be bound by the evaluation of the language fragment (logic programming: positive use). It can either be provided explicitly by the user as illustrated next, or can be computed from the fragment as discussed in Section 3.4. The salient aspects of signatures are captured by this nearly-DTD specification:
41
42
1 2 3 4 5 6 7 8 9 10
The explicit specification of the signature can be provided as part of the XML subtree of the actual language fragment. Here, the eca:name attribute is used to specify the global name of the variable. If the language fragment expects another label for this variable, it can be temporarily relabeled to match the signature of the language fragment with eca:use. Once annotated, the variable bindings are handled automatically by the framework and are relabeled as indicated for each invocation (cf. Example 14 on page 46). Recall, that if a respective fragment is wrapped in an eca:Opaque element, it is mandatory to provide the respective input and output variable characteristics (cf. Example 13, Line 5), since no automatic analysis takes place for opaque language fragments. For a more functional style, where only one result is returned, the output can be directly assigned to a variable by adding the following attribute to the respective top-level element of the language fragment (the attribute definition is provided as nearly-DTD):
1
To sum up, to “marsify” a specific language, it suffices to provide an XML markup2 , and an associated processor capable of understanding the language fragment and MARS-style variable bindings, which are discussed next.
3.2.2. State, Communication, and Data Flow For combining heterogeneous component languages, the data model of the MARS framework must be as general as possible. Therefore, the data flow between processors of the component languages is provided by logical variables in the style of deductive rules and production rules, etc.: the state of computation of an ECA rule (and a CCS process) is represented by a set of tuples of variable bindings, i.e. every tuple is of the form 2
MARS also offers the possibility to specify the respective languages as RDF graphs, which is beyond the scope of this thesis (cf. [MAA05c, Sch09b]).
42
Chapter 3. The MARS Framework
43
t = {v1 /x1 , . . . , vn /xn } with v1 , . . . , vn variables and x1 , . . . , xn elements of the underlying domain (which is the set of strings, numbers, and XML literals). Note that not each tuple necessarily binds the same variables, and each state is associated with a set of active variables w1 , . . . , wm (the union of the variables of all tuples). Every tuple is a partial tuple over these, i.e. t = {wi1 /x1 , . . . , win /xn } where {i1 , . . . , in } ⊆ {1, . . . , m}. Thus, for given active variables w1 , . . . , wm , such a state can be seen as a relation whose attributes are the names of the variables. In relational terms, the non-bound variables are considered to be null values. The approach does only minimally constrain the embedded languages. For instance, all paradigms of query languages, following a functional style (such as XPath/XQuery), a logic style (such as Datalog [AHV95] or SPARQL), or both (F-Logic [KL89, KLW95]) can be used. Given this data model, the evaluation of ECA rules can be written in the style of production rules (cf. Figure 3.3): action(v1 , . . . , vn , . . . , vk ) ← event(v1 , . . . , vn ), query(v1 , . . . , vn , . . . , vk ), test(v1 , . . . , vn , . . . , vk ) The successful detection of an event binds the initial set of variables v1 , . . . , vn that are extended by the additional variables vn+1 , . . . , vk in the query component, and the test component selects the appropriate set of tuples matching additional constraints. Afterwards, the action component is executed with the variables v1 , . . . , vk . Figure 3.3 Use of variables in an ECA rule
event component binds v1 , . . . , vn
register event component
query component over v1 , . . . , over v1 , action component ⇒ ⇒ ⇒ vn , . . . , vk ..., uses v1 , . . . , vk join vars: v1 , . . . , vn vk binds vn+1 , . . . , vk
upon detection: result variables
(Composite) Event Detection Engine
send action +vars
send receive query result
Action/Process Engine
Query Engine
Processing (Preview) The ECA engine processes the rule components in a homogeneous way by submitting the component language fragment together with the actual variable bindings to an appropriate service. The service returns the result variable bindings, which are combined as described above with the current state of the rule. The details of the
43
44 handling of variable bindings with a special focus on the transparent management in a relational database management system are discussed in [May09]. Consider the case of an external query, where (vij+1 , . . . , vk ) ← q(vi1 , . . . , vij ) is the signature of the query, and q the unique id of the query. Then, qin := (vi1 , . . . , vij ) are the input variables (or is the input signature) and qout := (vij+1 , . . . , vk ) are the output variables (or is the output signature). Before execution of q, the framework automatically computes the projection π[qin ] and sends the input variable bindings to the query engine. The query engine processes the query and returns as answers tuples over the bound input variables with their associated bound output variables qin ∪ qout . If there is no result for a specific input tuple, then the semantics is up to the query processor; typically, the input variables for which no results have been found are not returned to the caller. In any case, the resulting new variable bindings are joined with the variable bindings before the execution, resulting in the set of active variables w := {v1 , . . . , vn , . . . , vk }, where not necessarily all active variables are bound in each tuple. Figure 3.4 illustrates the accompanying data flow between the MARS framework and the external query engine. The input and output signature can either be declared explicitly by the designer of an ECA rule/CCS process (including local renaming of variables) in the syntax described above or can be analyzed automatically (cf. Section 3.4).
Figure 3.4 Shipping of variable bindings to query engine and back current variable bindings v1 , . . . , vn
π[vi1 , . . . , vij ]
⊲⊳
new variable bindings v1 , . . . , vn , . . . , vk
relevant input variables Query Engine query answers vi1 , . . . , vij , . . . , vk
The next section covers the action component of the MARS framework. The current implementation of the MARS framework proposes the CCS process algebra [Mil83] as sample language for this component. Actually, ECA rules and CCS processes are equally expressive, especially the modeling of the former by the latter is trivial since actions can also be events, queries, or tests and the data flow through the process is analogous to the data flow between the ECA components.
44
Chapter 3. The MARS Framework
45
3.3. The Action Component Normally, the action part in ECA rules only supports simple action patterns, such as sequences of actions. For many more involved real-world problems this falls short and a more powerful action language is desirable. The MARS framework thus employs the process algebra Calculus of Communicating System (CCS) [Mil83, Mil89] as sample language for the action part. Thereby, more intricate action workflows can be specified, including but not limited to conditional execution of alternative branches, processes “receiving” events (cf. the event part of ECA rules), and recursive processes. Note that similar behavior could be achieved by applying multiple, simpler ECA rules, but the modeling would be more cumbersome and also be prone to unintended behavior. In the following, Section 3.3.1 presents the atomic constituents supported by the MARS CCS implementation with an accompanying discussion of the XML markup of each constituent. Afterwards, Section 3.3.2 presents the CCS operators with their XML markup. As mentioned above, the XML markup not only serves the role of a language serialization format but the namespaces of different XML fragments mark the language borders and associate the described language fragments with their respective processors.
3.3.1. Atomic Constituents All the ECA components can be used as atomic constituents in MARS CCS processes, i.e. the atomic constituents can either be events (which only play a minor role in this thesis), queries, tests, or actions. In the following, each of the atomic constituents is introduced, before the operators of the MARS CCS variant are discussed in Section 3.3.2. Events Events can also be used as atomic constituents in CCS processes, where the evaluation is similar to their use in ECA rules as introduced above. If during the evaluation of a CCS process an event specification is encountered (specified by ccs:Event), execution stops until the event is detected, and continues afterwards with the newly bound variables. From this perspective, events are similar to queries that are presented next. The associated XML markup for events is defined by this nearly-DTD specification:
1 2 3
Each event specification is wrapped in a ccs:Event “container” element. The actual XML fragment that describes the respective event (language) is then provided as child
45
46 element of the ccs:Event element (which is similar to the way they are used in ECA rules as shown in Example 11 on page 39). Queries The focus of this thesis is on combining heterogeneous Web data sources in query workflows. To achieve this goal, a necessary prerequisite is the access to these data sources, and – from a more abstract point of view – to provide a uniform query interface (in the remainder called view) to the data sources. To use these views in the query component of CCS processes (and ECA rules) in the MARS framework they need to be addressable by a marsified query language that has an associated query processor (cf. Figure 3.2). Part III introduces the details of the description and access of Web data sources, and presents for this purpose two MARS-style query languages: the Deep Web Query Language (DWQL) (geared towards Deep Web data sources), and the Web Service Query Language (WSQL) (geared towards already machine-accessible Web data sources). The incorporation of specific query languages as atomic constituents of CCS processes is achieved by the associated (top-level) XML markup defined by this nearly-DTD specification:
1 2 3
Each query is wrapped in a ccs:Query element. The actual XML fragment that describes the respective query (language) is then provided as child element of the ccs:Query element as illustrated in Example 14, which also highlights the data flow between the MARS framework and the query processor. Example 14 The Internet Movie Database (IMDB)3 is a source for information concerning movies and actors. Assume a CCS process that was designed to determine the relationship between Hollywood and US politicians. It sequentially queries different Deep Web data sources, among them the (online) query movie ← actor2movie(actor), which – by accessing the IMDB – maps actors to their movies. The according XML markup of this query in the MARS-compatible Deep Web Query Language (DWQL) is shown in Line 6 to 10 (details about this query language will be provided in Chapter 5) of this process fragment:
3
http://www.imdb.com
46
Chapter 3. The MARS Framework
1
2 3 4 5 6 7 8 9 10 11 12 13
47
...
...
The surrounding ccs:Query element (Line 5 and 11) specifies that the CCS processor has to handle the contained content as a query. By looking at the associated namespace of the child element, the MARS framework determines the associated query processor for the DWQL language and the communication details. Now, let the so far bound variables represent the following three US politicians: {(p/“Arnold Schwarzenegger”), (p/“Ronald Reagan”), (p/“Jimmy Carter”)}. The explicitly given signature of the query actor2movie (Line 7 and 8) specifies that the input tuples need to be renamed, and the DWQL XML fragment (Lines 6 to 10) is sent to the DWQL processor along with the renamed variable bindings {(actor/“Arnold Schwarzenegger”), (actor/“Ronald Reagan”), (actor/“Jimmy Carter”)}. The DWQL processor then executes the query and returns the result variable bindings for the politicians that also had roles in Hollywood movies (i.e. “Arnold Schwarzenegger” and “Ronald Reagan”), where no movie results are found for the former US president “Jimmy Carter” (and the processor does not return the additional input tuple): {(actor/“Arnold Schwarzenegger”, movie/“Terminator I”), (actor/“Arnold Schwarzenegger”, movie/“Terminator II”), ... (actor/“Ronald Reagan”, movie/“Prisoner of War”)}. These result variable bindings are automatically renamed by the MARS framework to {(p/“Arnold Schwarzenegger”, m/“Terminator I”), (p/“Arnold Schwarzenegger”, m/“Terminator II”), ... (p/“Ronald Reagan”, m/“Prisoner of War”)}.
47
48 and joined with the original variable bindings {(p/“Arnold Schwarzenegger”), (p/“Ronald Reagan”), (p/“Jimmy Carter”)}. resulting in the variable bindings (after the call to the query): {(p/“Arnold Schwarzenegger”, m/“Terminator I”), (p/“Arnold Schwarzenegger”, m/“Terminator II”), ... (p/“Ronald Reagan”, m/“Prisoner of War”)}. The “waiting” tuple (p/“Jimmy Carter”) has been removed by the join with the result variable bindings (cf. Figure 3.4) and the process continues with these new variable bindings in Line 12. Tests Abstractly, tests are queries that only return true or false as results and can be characterized by the signature vout ← t(v1 , . . . , vk ) with tin := (v1 , . . . , vk ) as input variables and tout := (vout ) as single output variable, which is either true or f alse. Tests are evaluated for each tuple in the variable bindings, and if it satisfies the test – i.e. the result of evaluating the test is true – it is retained, otherwise it is removed from the variable bindings. This is the default behavior (also called tuple mode); alternatively, tests can be specified that check if at least one tuple in the variable bindings satisfies the test (some/exist mode), not a single tuple in the variable bindings satisfies the test (none/not exist mode), or if all tuples in the variable bindings satisfy the test (all/every mode). If the mode is different from tuple, and the test is satisfied, all tuples in the tested variable bindings survive, otherwise all tuples are removed. The basic tests (e.g. =, >,
48
Chapter 3. The MARS Framework
6 7 8 9 10 11 12 13 14 15 16
49
"http://www.semwebtech.org/languages/2006/ccs#" xmlns:ops CDATA #FIXED "http://www.semwebtech.org/languages/2006/ops#" ccs:quantifier CDATA "tuple" >
The evaluation mode of tests is specified by the attribute ccs:quantifier (defaulting to tuple ). The operator ops:IsNull returns true, if a variable is not bound in a tuple, and the operator ops:IsNotNull does the opposite. Example 15 Reconsider Example 13 on page 41 that contains an opaque test in XPath, which returns true, if the input variable distance is smaller than 1000. Let the following set of three tuples be bound before the evaluation of the test: {(. . . , distance/“1200”, . . . ), (. . . , distance/“950”, . . . ), (. . . , distance/“200”, . . . )} The test will now be evaluated for each tuple and only the ones satisfying the criteria (i.e. that distance < 1000) will be retained. Thus, the process continues processing after the test with the resulting set of two tuples that “survive” the test: {(. . . , distance/“950”, . . . ), (. . . , distance/“200”, . . . )} Actions From the MARS point of view, actions are queries that can be characterized by the signature ∅ ← a(v1 , . . . , vk ) with ain := (v1 , . . . , vk ) as input variables. They have an empty output signature and can be either confirmed or fire-and-forget depending of the implementation of the external action service. In the former case, the CCS process waits for a synchronous or asynchronous confirmation that the action was executed and in the latter case, the associated XML fragment is sent to the processor and the process continues right away. Confirmed actions are useful in cases where actions need to be executed in a certain order (cf. Example 16). The nearly-DTD for actions is as follows:
49
50
1 2 3
The XML markup is almost identical to the one shown for queries above. Since actions are considered to be the default atomic constituent in CCS, the surrounding ccs:Action element can be omitted (as demonstrated in Example 16). Example 16 The prototypical example for a sequence of actions is the debit-deposit scenario, where a person is transferring money from one account to another account. Typically, first the account where the money is transferred from needs to be debited with the amount and only afterwards the money is deposited to the“receiving” account. The according CCS process is specified as a sequence where the deposit action directly follows the debit action:
1
2 3 4 5 6 7 8 9 10 11 12 13 14
...
...
Now the semantics of the process depends on the banking action service provider. By default, actions are asynchronous and fire-and-forget, i.e. in this case it might happen that the deposit occurs before the debit, if the action can be performed faster. Yet, if the service provider uses confirmed actions, the order of debit and deposit actions would happen as exptected in this case. (Obviously, in a real-world scenario, the process would be much more involved, including the notion of a transaction.)
50
Chapter 3. The MARS Framework
51
3.3.2. Operators A CCS algebra with a carrier set A (its atomic constituents, as described in Section 3.3.1) is defined as follows (here, the asynchronous variant of CCS that allows for implicit delays is considered), using a set of process variables: • every a ∈ A is a process expression, • with X a process variable and P a process, X := P is a process definition and a process, and X is a process expression (equivalently, recursive processes can be described by a fixpoint operator; here, the process definition variant is chosen), • with a ∈ A and P a process expression, a.P is a process expression (prefixing; sequential composition), • with P and Q process expressions, Sequence(P ,Q) is a process expression (sequential composition), • with P and Q process expressions, P |Q is a process expression (concurrent composition), P • with I a set of indices, Pi : i ∈ I process expressions, i∈I Pi (binary notation: P1 +P2 ) is a process expression (alternative composition), • 0 is a process that stops execution. The (operational) semantics is defined in [Mil83] by transition rules that immediately induce an implementation strategy. By carrying out an action, a process changes into another process as shown in Figure 3.5. Figure 3.5 CCS transition rules a
a
P→ – P′ P → – P a.P → – P , , P i (for i ∈ I) , a a ′ (P, Q) → – (P , Q) – P i∈I Pi → a
a
a
a
Q→ – Q′ X := P P → – P′ P→ – P′ , , . a a a P |Q → – P ′ |Q P |Q → – P |Q′ X→ – P′
Note that prefixing a.Q is actually a special case of Sequence(P, Q) where P is atomic. Process expressions not containing any free process variables are processes – process variables are not identical to MARS-style logical variables, but are bound to process expressions. The 0 process does nothing, and it does not continue. It is e.g. used for terminating iterative processes. For the normal flow of execution it can be omitted; if the final stage in a process expression is executed, the process terminates. The discussion of the semantics of CCS processes in the presence of logical variables is postponed until Chapter 7, where the so far presented basic operators of CCS are extended to specify expressive query workflows over Web data sources. Example 17 provides some initial intuition of the semantics of CCS in the presence of variable bindings. Example 17 Consider this process in pseudo-code notation:
51
52
1 2 3 4 5 6
sequence (event (input(n)), concurrent ( sequence (test (n < 0), action (print("forget it"))), sequence (test (n ≥ 0), query (m ← sqrt(n)), action (print("sqrt: " + m)))), action (beep))
The process will wait for an input event with one argument, e.g., input(4). Once this event is detected, two alternative branches are started in parallel. The first one tests whether n < 0, and, if so, shows its disappointment. The second one tests whether n ≥ 0, √ and if so, assigns n to a new variable m and outputs it. In both cases, it will then continue with a beep action and finish. In this example, there are two variables used that are bound to values, which make up the data flow. The XML serialization of the CCS syntax adheres to this nearly-DTD (details omitted):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
52
Chapter 3. The MARS Framework
26 27 28
53
The XML markup for the event, query, action, and test operands has already been discussed in Section 3.3.1. Each operator can start with a process definition, which can later be called with appropriate variable bindings. For defining recursive processes, the ccs:CallProcess element is placed inside a ccs:ProcessDefinition subtree (recursion will be discussed further in the context of query workflows in Chapter 7). Note that opaque language fragments, and generally fragments from other languages, can also function as operands (either directly – interpreted as actions – or as events, queries, or tests – then wrapped in their respective “container” elements).
3.4. Technical Realization The language borders between the CCS process level and other (component) languages manifest themselves in the change of the namespaces associated with a language fragment. Spoken in RDF terms, the namespace URI identifes the language as an RDF resource. Building on this interpretation, MARS manages the meta data of each language processor in an RDF graph, which is at the core of the Language and Service Registry (LSR) infrastructure service (cf. [FMS08]). The communication between language processors and the CCS engine is depicted in Figure 3.6: if the CCS engine encounters a leaf expression (i.e. atomic constituent), the Generic Request Handler (GRH) is provided with the available input variable bindings and the respective XML fragment. The GRH then first queries the LSR for the actual URL of the language processor and its communication details. Then – depending if the processor supports the automatic analysis of its own component expressions – the component expression is sent to the language processor which returns the corresponding input and output signature. Alternatively, the signature can be explicitly provided (cf. Example 14 on page 46). Now, the GRH can perform the projection on the required input variables which are then sent along with the component expression to the processor, which processes the component expression and (optionally) returns the result variable bindings (via the GRH) to the CCS engine. The CCS operators are handled directly by the CCS engine, which keeps track of the overall data flow and variable bindings. The relation between the GRH and the LSR with a special focus on the meta data stored in the LSR is discussed next. GRH & LSR The GRH serves as the communication switch between language processors. To avoid bottlenecks, the GRH is not a separate service, but is provided as a MARS Java class, which can be included in any engine that deals with embedded fragments (i.e.
53
54 Figure 3.6 MARS CCS Architecture and Communication CCS Engine:
process specification by CCS operators and leaf elements • process CCS operators • handle leaf expressions via GRH
Languages & Service Registry
resulting var.bdgs
determine service URL & comm. params
Generic Request Handler
query engine for DWQL query engine for WSQL engine for generic datatype CGDT
Surface Web Deep Web Data Sources Data Sources
embedded fragments input var. bdgs
component expressions input bindings
every language processor usually has its own GRH). As mentioned above, the LSR is realized as an RDF graph describing the available services with their characteristics. The GRH thus is initialized with the current LSR state at startup time so that the communication with the LSR can be realized as local SPARQL queries. This is an excerpt of the description of the CCS process algebra in the LSR:
1 2
@prefix lsr : . @prefix mars: .
3 4 5 6 7 8
a mars:ProcessAlgebra; mars:name "CCS" ; mars:hasRDF ; mars:hasDTD ; mars:is-implemented-by .
9 10 11 12 13 14 15
a mars:ActionService; lsr:has-task-description [ a lsr:TaskDescription; lsr:describes-task ; lsr:provided-at ;
54
Chapter 3. The MARS Framework
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
55
lsr:Reply-To "body" ; lsr:Subject "body" ; lsr:mode "synchronous" ; lsr:input "element execute" ; lsr:variables "*" ]; lsr:has-task-description [ a lsr:TaskDescription; lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "n.a." ; lsr:Subject "n.a." ; lsr:input "item" ; lsr:variables "n.a." ].
The first observation is that the URI of the CCS namespace is used as RDF resource with further associated meta data about the language. This is a good example of how to put URIs to use in a Semantic Web setting: linking information from different realms via URIs. The RDF graph specifies that CCS is a process algebra (Line 4) that is implemented by an action service (cf. Line 8 and 10). Furthermore, the RDF graph contains general information such as the URL of the actual processor (indicated by mars:is-implemented-by, Line 8), and where the RDF description of the language can be found (Line 6). Each MARS service (or engine) has a set of tasks it can perform that are associated with each type of service, e.g. an action service can execute actions, and a composite event detection engine can (de-)register events, and can receive detected events. The LSR specifies for each processor for each relevant task the necessary functional characteristics, such as how to invoke it (lsr:provided-at, Line 15 or 25), if the service operates synchronously (i.e. it uses the same HTTP connection for the request and receiving the results), or asynchronously (i.e. the request is sent by one HTTP connection, which is then closed, and the result is sent afterwards with another (new) HTTP connection, where the MARS framework takes care of the correlation between open requests and asynchronously received results; indicated by lsr:mode, Line 18), or if the service natively supports variable bindings (lsr:variables, Line 20 or 30). Variable bindings can be supported to a different extent: most notably, there is full support (“*”), i.e. the set of tuples of variable bindings is accepted by the service, serial support (“1”), i.e. only one tuple is accepted at a time and the calling service has to iterate over the set of input bindings, which is the case for a typical Web Service-style service. Lastly, it might be that no variable bindings are accepted (“no”), and the calling service iterates over all input bindings and converts them in the proprietary format and
55
56 afterwards “marsifies” the results by wrapping them as variable bindings4 . A complete reference of all available task options can be found at [AAB+ 08]. One generic task description that deserves special attention is ...#/analyze-variables. If a service supports this task, the component expression can be sent to the respective language processor, which returns the set of required input variables and also which result variables will be bound by this expression. Afterwards, the GRH only needs to send the relevant input variable bindings. Concluding Remarks MARS shares ideas with the Common Object Request Broker Architecture (CORBA) architecture [COR09], e.g. both are high-level infrastructure frameworks that enable interoperation between heterogeneous languages. But while CORBA mainly operates on a lower, more technical level, the MARS framework is a full citizen of the Semantic Web, and interoperability is not achieved by compiling an interface definition language to stubs and skeleton code, but by abstracting the general data flow as variable bindings – that can e.g. be exchanged in an XML serialization or by a database [May09] – and providing a high-level data flow between the nested language constituents. It was already argued that CCS has the same expressiveness as ECA rules. The focus in this thesis is on information acquisition from heterogeneous Web data sources, which can naturally be represented as query workflows. Consequently, and without loss of generality, the pursued course is to expose Web data sources via MARS-enabled query engines (DWQL & WSQL, Chapter 5). Then, the combination of different Web data sources can be expressed by an extension of CCS (RelCCS, Chapter 7), where a generic data type manages the exploration of transitive closures in the search space (CGDT, Chapter 8, cf. Figure 3.6). But first, the notion of a Web data source will be clarified and captured for the MARS world in the next chapter.
4
Generally, if an option is not applicable for a task, it is indicated by the literal "n.a." .
56
Part III. Accessing Web Data Sources
57
Chapter 4. The Annotation Layer No computer has ever been designed that is ever aware of what it’s doing; but most of the time, we aren’t either. – Marvin Minsky (born August 9, 1927)
Contents 4.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.2. Preliminary Considerations . . . . . . . . . . . . . . . . . . . .
61
4.3. Annotation of Web Data Sources . . . . . . . . . . . . . . . . .
63
4.4. Web Data Source Description Language . . . . . . . . . . . . .
65
4.4.1. Base Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.4.2. Derived Views . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.5. Use Case: Annotation of Web Service Sources . . . . . . . . .
76
4.5.1. SOAP/WSDL . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.5.2. Representational State Transfer (REST) . . . . . . . . . . . . .
80
4.6. Annotation of Query Workflows . . . . . . . . . . . . . . . . . .
83
4.7. Outlook: Query Workflows over Web Data Sources . . . . . .
86
4.8. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
4.1. Introduction The Web is today a major source for retrieving data, e.g. information regarding train and flight connections. Via the Web, this data is usually either human-readable (Deep Web), or machine-accessible with heavy use of low-level technologies such as REST [Fie00] or SOAP/WSDL [Cer02, CCMW01, CMRW07, CHL+ 07] (Web Services). These sources could be very useful for declarative query answering and data workflows, but the task of combining information from different such sources is cumbersome, and is an actively researched area [BCDM08, CM10].
59
60 The term Web data sources covers Web sources of the above kinds that allow access to an underlying database-style repository, and that exhibit at least implicitly a tablelike schema. Fortunately, this is the case for virtually all structured, Web-accessible data sources, ranging from Deep Web [FLM98] sources – databases that are accessible via filling out Web forms – to data-centric Web Service APIs. Each Web data source typically offers multiple interaction patterns, e.g. to query train connections by departure or arrival time, which constitute the queryable views of the source. Figure 4.1 Web Data Source Access Layers
1..*
¤
Web Data Source ¤ Deep Web Source
Web Service Source
View ¤
Base View
¤
Annotation Layer
Derived View
“logical” access
“logical” access Query Layer Deep Web QL
Web Service QL ↓implemented-by
↓implemented-by DWQL Engine
WSQL Engine
“physical”access
“physical”access
Source Layer DW1
...
DWn
WS1
...
WSm
Figure 4.1 depicts the different abstraction layers with respect to accessing Web data sources. On the annotation layer, the capabilities of Web data sources are described with the Web Data Source Description Language (WDSDL) ontology. As mentioned above, each Web data source provides one or more views, which can be either base views (directly queryable Web data source abstractions), or derived views (higher-level views that are defined transparently on top of base views). Web data sources can be partitioned into two types: those that are not already machine-accessible (Deep Web sources), and those that are already machine-accessible (Web Service sources). The query layer (discussed in Chapter 5) builds upon the notions of the annotation layer and comprises the “logical” access – connecting the ontological definition of a Web
60
Chapter 4. The Annotation Layer
61
data source and its capabilities to a MARS query language – to Web data sources. Two MARS query languages belong to this layer: the Deep Web Query Language (DWQL) geared towards Deep Web sources and the Web Service Query Language (WSQL) that covers sources that are already machine-accessible, including but not limited to RESTand SOAP/WSDL-based Web Services. The ultimate “physical” access to Web data sources is dealt with in the source layer. This is well-understood for Web Service-like sources but how to query Deep Web sources is still an active research area (cf. [CM10]). Chapter 6 deals with the low-level issues of accessing Web data sources with a special focus on Deep Web sources. Here, the idea is to first automate the navigation to the desired result page containing the hidden data as HTML contents, and then to extract and label the therein contained data records. Chapter Structure This chapter starts with an evaluation of the requirements for the description of Web data sources in Section 4.2. Section 4.3 continues this line of thought and introduces the different annotation levels. Section 4.4 presents the WDSDL ontology, which conceptualizes the semantic annotations presented in the previous section, and discusses the distinction between base views and derived views. Afterwards, in Section 4.5 the so far presented concepts are used to annotate the two most prevalent types of Web Service sources (SOAP/WSDL and REST) as Web data sources. Section 4.6 highlights the benefits of annotating Web data sources for query workflows. Finally, Section 4.7 gives an outlook on using Web data sources in query workflows and the chapter concludes with a presentation of related work in Section 4.8.
4.2. Preliminary Considerations Handling and combining queries against autonomous Web data sources requires meta data knowledge about the values to be expected to deal with. This is especially important, when the result of one source is used as input for another source or when results from different sources are combined. Consider for example the scenario of finding cheapest routes to a destination where different national railway portals are queried for pricing information. In most cases, each source will return the prices in the national currency, i.e. for calculating the overall price for a connection, the prices need first to be converted in the desired result currency and are afterwards accumulated. Ideally, the required conversions should be handled transparently by the framework. In this section the foundation for such automatic handling of heterogeneous values is laid by examining the requirements and proposing a comprehensive solution on top of an annotation ontology. Schema On the most basic level, each input and return value of a Web data source can be considered on the schematic level, i.e. it has a data type similar to the schema of a database. Here, the same data types as known from common programming languages and database systems are used. For the RDF world, which is used as the meta-language
61
62 for describing the semantic annotations, these are provided by the XML Schema data types [PGM+ 09] (cf. Section 2.2.1). In addition to the basic data types like strings and numbers, XML Schema provides xs:date, xs:time and xs:dateTime data types similar to SQL. However, in the remainder not the default formats of the XML Schema data types, but the date and time patterns defined in [Mic08] for the Java programming language are used. This is another example that is specific to the requirements of heterogeneous and autonomous sources. In a closed environment all dates would be given with respect to a uniform format, but on the Web, each data source might support a proprietary date representation, e.g. “MM/dd/yyyy“ for a US Web source. Semantics Recall the initial travel example, where prices are given with respect to different currencies. In this example, the annotation with only data types is not sufficient, but also the notion of dimension is required. For this, in this thesis additionally dimensions are considered, which can be either physical dimensions, such as length, duration, or voltage, or non-physical dimensions like distance (which is physically a length), or price. Every dimension is associated with a set of units. The values of dimensions are usually given by (value, unit)-pairs, like (100, km), or (250, e). Here again, in contrast to a closed environment, the units may differ between autonomous sources (e.g. miles vs. kilometers, or $ vs. e or £). The Annotation Ontology In this thesis, query workflows and their constituents can be represented and annotated in RDF [MM04] and OWL [MvH04]. This is a fragment of the ontology for dimensions, units, and conversions:
1 2 3 4 5 6
@prefix @prefix @prefix @prefix @prefix @prefix
mars: dim: unit: curr: xs: xurrency:
. . . . . .
7 8 9 10 11 12 13 14 15 16 17 18
# RDFS vocabulary layer (excerpt) mars:Dimension a owl:Class. mars:Unit a owl:Class. mars:Conversion a owl:Class. mars:DynamicConversion rdfs:subClassOf mars:Conversion. mars:FixedConversion rdfs:subClassOf mars:Conversion. mars:hasUnits rdfs:domain mars:Dimension; rdfs:range mars:Unit. mars:from rdfs:domain mars:Conversion; rdfs:domain mars:Unit. mars:to rdfs:domain mars:Conversion;
62
Chapter 4. The Annotation Layer
19 20 21 22 23 24 25
63
rdfs:domain mars:Unit. mars:factor rdfs:domain mars:Conversion; rdfs:range xs:double. mars:for rdfs:domain mars:DynamicConversion; rdfs:range mars:Dimension. mars:useView rdfs:domain mars:DynamicConversion; rdfs:range wdsdl:View.
26 27 28 29 30 31 32 33 34 35 36 37 38
# RDF instance layer (excerpt) dim:Length a mars:Dimension; mars:hasUnits unit:meter, unit:kilometer, unit:mile, ... dim:Price a mars:Dimension; mars:hasUnits curr:USD, curr:EUR, curr:PLN, ... dim:MileToKm a mars:FixedConversion; mars:from unit:mile; mars:to unit:kilometer; mars:factor "1.6093" ^^xs:double. dim:PriceConv a mars:DynamicConversion; mars:for dim:Price; mars:useView .
Some conversions are fixed (e.g., miles to kilometers), and some are dynamic, e.g., $ to e; for the latter, Web data sources can be specified (cf. Section 4.3). The units are represented by fixed URIs, such as for e.
4.3. Annotation of Web Data Sources The annotation of Web data sources also concerns different levels. The technical level describes how to address the source (or its wrapper, respectively). The signature level corresponds to the signature of query services, specifying their input and output parameters. In addition, the MARS ontology of annotations introduced in the last section can be used for the annotation of input and output parameters, as well. Characteristics of Web Data Sources Conceptually, every Web data source w can be seen as a predicate q0 (its characteristic predicate, which is constituted by the relation that contains all possible input/output combinations) over variables v = {v1 , . . . , vn }. The different interaction patterns (i.e., forms or Web Service invocations) with a Web data source can be regarded as predefined views over its characteristic predicate, which can also not be queried in general, but only via a restricted access pattern, i.e. certain input arguments must be given to return the corresponding output values. The modeling associates
63
64 each view w with a unique identifying URI (6= the URL of the corresponding Web form, but “simply” some RDF URI). For each view, some attributes qin = {vin1 , . . . , vink } ⊆ v act as inputs, others qout = {vout1 , . . . , voutm } ⊆ v \ qin act as outputs. This is called the signature of the view, and is denoted by qout ← w(qin ). In the remainder, variable bindings are provided in a slotted notation, e.g. w((vin1 /x1 , . . . , vink /xk )) means that the view w is invoked with the variable vin1 bound to value x1 , . . . , and the variable vink to the value xk (cf. the presentation of MARS variable bindings in Section 3.2.2). Technical Level As mentioned in the introduction, Web data sources can be distinguished into two types: Deep Web sources where the primary interface is given by a Web form, and Web Services that are machine-accessible via protocols like REST or SOAP/WSDL, but do not always provide a declarative interface in any query language. For both types, generic wrapper interfacing languages have been designed in this thesis, called Deep Web Query Language (DWQL) that allows to pose queries against Deep Web sources, and Web Service Query Language (WSQL) that supports – among others – the generic querying of (REST- and SOAP/WSDL-based) Web Services. Both languages are discussed further in Chapter 5. To the outside, DWQL and WSQL provide a uniform set-oriented interface of the generic form qout ← w(qin ): given a set R of tuples of variable bindings over the input variables qin for that view, and with q0 the characteristic predicate of w, where π denotes relational projection, and ⊲⊳ denotes the natural join, the wrapper returns the set of tuples π[qout ∪ qin ](q0 ⊲⊳ R). Note that the actual wrappers that map the Web data sources to DWQL or WSQL have to be programmed manually. The annotations start above this level and annotate the wrapped source. Signature Level The first step of assigning a signature consists of naming the variables of the characteristic predicate by so-called tags, similar to the way they are used in social bookmarking sites, such as delicious1 . Example 18 The online train schedules of railway companies are a typical example for Deep Web sources. Users can enter a start and a destination, a date and a desired departure or arrival time. The answer contains a list of relevant connections, usually together with prices. The German railways Web portal at http://www.bahn.de can be described with the tags start, dest, deptTime, arrTime, desiredDeptTime, desiredArrTime, date, duration, price (note that the chosen tag names are not necessarily the same as the ontology notions – they will be correlated explicitly later). The provided views have the signatures: (deptTime, arrTime, duration, price) ← germanRailwaysByDept(start, dest, date, desiredDeptTime) (deptTime, arrTime, duration, price) ← germanRailwaysByArr(start, dest, date, desiredArrTime) . 1
http://del.icio.us.com
64
and
Chapter 4. The Annotation Layer
65
A result of querying the first view looks as follows: germanRailwaysByDept( ¨ (start/“Freiburg”, dest/“Gottingen”, date/“03.02.2009”, desiredDeptTime/“08:00”)) = { (deptTime/“08:57”, arrTime/“13:07”, duration/“4:10”, price/“95.00”), (deptTime/“09:03”, arrTime/“14:48”, duration/“5:45”, price/“85.00”), . . . }
Note that there is e.g. no view to retrieve all cities that can be reached from a given starting point within one hour traveling. In set-oriented approaches like MARS, the input can consist of multiple tuples. For that, the answer tuples in this chapter are always assumed to also contain the bindings of all input variables. With this, the results can be joined accordingly as described in Chapter 3. So far, there are only strings. Annotations are now made on the tag level, since the same annotations hold for each view over the source. According to Section 4.2, each tag is associated to some datatype, optionally to a syntactical representation, and a unit. Example 19 For the German Railways source, the tags are annotated as follows with datatypes, dimensions, syntactical representation (usually called format), and units. Tag
Datatype
Format
Unit
start, dest deptTime, arrTime, desiredDeptTime, desiredArrTime duration date price
xsd:string
–
–
xsd:time xsd:time xsd:date xsd:decimal
“HH:mm” (internal) “HH:mm” (internal) “dd.MM.yyyy” (internal) curr:EUR
Web data source descriptions of other railway portals have a similar signature, except probably the date format, and the currency: the source description of the analogue for Polish railways, http://www.pkp.pl, differs only in the last entry – the unit of their price is Zloty, denoted by the URI . The following section introduces the WDSDL ontology for expressing Web data source descriptions themselves in RDF.
4.4. Web Data Source Description Language As argued before, the natural choice for providing meta data about resources in the Semantic Web is to define an ontology, also called vocabulary, capturing the abstract notions of the domain of interest in RDF(S) (and OWL). Thus, in a first step, an ontology for the abovementioned features of Web data sources is described in Section 4.4.1. This ontology for base views serves as the basis for defining derived views, which are discussed in Section 4.4.2.
65
66
4.4.1. Base Views The central notion of Web data sources is, as introduced above, the characteristic predicate, where the Web data source provides one or more views upon. The WDSDL ontology first provides notions for describing the characteristic predicate by enumerating the tags used by the source. In RDF terminology, the tags are just resources that have a name, and that can be annotated by datatypes, and optionally by dimensions (i.e. “time”, “price”, etc.), formats (e.g “HH:mm” for times), and units (meters, miles, $, e, etc.) (cf. Section 4.2 and [HM09a]). The WDSDL vocabulary continues with describing the views provided by the source, and for each view which tags are used in it as input or output. This is the WDSDL ontology for the technical and signature level in RDF(S):
1 2 3 4 5
@prefix @prefix @prefix @prefix @prefix
mars: owl: rdfs: wdsdl: xs:
. . . . .
6 7 8 9 10 11 12
# Classes wdsdl:WebDataSource a owl:Class. wdsdl:View a owl:Class. wdsdl:Tag a owl:Class. wdsdl:DeepWebSource rdfs:subClassOf wdsdl:WebDataSource. wdsdl:WebServiceSource rdfs:subClassOf wdsdl:WebDataSource.
13 14 15 16 17 18 19 20 21 22 23 24 25 26
# Properties (base) wdsdl:baseURL rdfs:domain wdsdl:WebDataSource; rdfs:range xs:anyURI. wdsdl:providesView rdfs:domain wdsdl:WebDataSource; rdfs:range wdsdl:View. wdsdl:hasTag rdfs:domain wdsdl:WebDataSource; rdfs:range wdsdl:Tag. wdsdl:hasInputVariable rdfs:domain wdsdl:View; rdfs:range wdsdl:Tag. wdsdl:hasOutputVariable rdfs:domain wdsdl:View; rdfs:range wdsdl:Tag. wdsdl:name rdfs:domain wdsdl:Tag; rdfs:range xs:string.
27 28 29 30 31
# Properties (dimensions, measurements, and units) wdsdl:datatype rdfs:domain wdsdl:Tag; rdfs:range xs:anySimpleType. wdsdl:format rdfs:domain wdsdl:Tag;
66
Chapter 4. The Annotation Layer
32 33 34 35 36
67
rdfs:range xs:string. wdsdl:dimension rdfs:domain wdsdl:Tag; rdfs:range mars:Dimension. wdsdl:unit rdfs:domain wdsdl:Tag; rdfs:range mars:Unit.
The base part of the above ontology deals with the structural aspects of signatures and introduces the main vocabulary for describing tags, and Web data source views. The part dealing with dimensions, measurements, and units is inspired by the preliminary considerations in Section 4.2 and provides the necessary vocabulary for annotating the tags with the respective formats, data types, and so on. Example 20 Consider the WDSDL annotation for the two views on the railway portal:
1 2 3 4 5
@prefix @prefix @prefix @prefix @prefix
dim: curr: xs: wdsdl: travel:
. . . . .
6 7 8 9 10 11 12
a wdsdl:DeepWebSource; wdsdl:baseURL ; wdsdl:providesView , ; wdsdl:hasTag _:start, _:dest, _:dDeptT, _:dArrT, _:date, _:deptT, _:arrT, _:dur, _:price.
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
_:start a wdsdl:Tag; wdsdl:name "start" ; wdsdl:datatype xs:string. _:dest a wdsdl:Tag; wdsdl:name "dest" ; wdsdl:datatype xs:string. _:date a wdsdl:Tag; wdsdl:name "date" ; wdsdl:datatype xs:date; wdsdl:format "dd.MM.yyyy" . _:dDeptT a wdsdl:Tag; wdsdl:name "desiredDeptTime" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:dArrT a wdsdl:Tag; wdsdl:name "desiredArrTime" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:deptT a wdsdl:Tag; wdsdl:name "deptTime" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:arrT a wdsdl:Tag; wdsdl:name "arrTime" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:dur a wdsdl:Tag; wdsdl:name "duration" ;
67
68
29 30 31
wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:price a wdsdl:Tag; wdsdl:name "price" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:EUR.
32 33 34 35 36 37 38
a wdsdl:DeepWebView; wdsdl:hasInputVariable _:start, _:dest, _:dDepT, _:date; wdsdl:hasOutputVariable _:deptT, _:arrT, _:dur, _:price. a wdsdl:DeepWebView; wdsdl:hasInputVariable _:start, _:dest, _:dArrT, _:date; wdsdl:hasOutputVariable _:deptT, _:arrT, _:dur, _:price.
The tag identifiers of the form “ :xxx” are blank nodes and act only internally as identifiers. To the outside, only the tag names are known as illustrated by the SPARQL query:
1 2 3 4 5 6
prefix wdsdl: select ?N where { wdsdl:hasOutputVariable [ wdsdl:name ?N ] }
that can be used to query the names of the tags that are output variables of the view with the following result: ?N "price" "duration" "arrTime" "deptTime"
4.4.2. Derived Views Sometimes the granularity of the native signature of a Web data source is ill-suited for the desired application, or the results of a Web data source cannot be used directly but an additional data cleaning and correlation phase has to be executed. These are just two examples of typical issues that need to be addressed in a scenario with heterogeneous, and autonomous Web data sources. Due to the design principle of separation of concerns this is handled in a generic fashion in this thesis by introducing derived views.
68
Chapter 4. The Annotation Layer
69
Projection of Output Variables The signature of Web data sources is fixed and defined by the service provider. Therefore, the number of and the selection of output variables can only be constrained internally inside the MARS realm. The benefit of doing this is twofold: bandwidth is saved during communication between MARS components since only the relevant variable bindings need to be transferred. Additionally, the calling service is not burdened with managing non-essential variable bindings. The latter aspect is especially important because variable bindings can grow in size during the evaluation via joins. Thus, it is paramount to focus on the relevant bound variables. Given the base view signature qout ← w(qin ), the derived view, which projects on the variables qproj ⊂ qout (a projection is only possible on the output variables, since all input variables need to be bound to satisfy the input constraints of the underlying Web data source), the signature of the derived view wproj is defined as qproj ← wproj (qin ).
Operationally, first the base view w is evaluated and afterwards the projection on the output variables is performed. This extension of the WDSDL ontology defines the required new classes and properties:
1 2
@prefix rdfs: . @prefix wdsdl: .
3 4 5 6 7 8
# Classes wdsdl:DerivedView rdfs:subClassOf wdsdl:View. wdsdl:BaseView rdfs:subClassOf wdsdl:View. wdsdl:DeepWebSource rdfs:subClassOf wdsdl:BaseView. wdsdl:WebServiceSource rdfs:subClassOf wdsdl:BaseView.
9 10 11 12 13 14
# Properties wdsdl:hasBaseView rdfs:domain wdsdl:DerivedView; rdfs:range wdsdl:BaseView. wdsdl:hasOutputVariableProjection rdfs:subPropertyOf wdsdl:hasOutputVariable. wdsdl:hasOutputVariableProjection rdfs:domain wdsdl:DerivedView.
The extension of the WDSDL ontology is conservative in that all annotations with respect to the base version of the ontology remain “consistent” as well. Note that the tags referenced by the property wdsdl:hasOutputVariableProjection must be used as output variables in the base view. Example 21 The view germanRailwaysByDept natively has the following output signature qout = (deptTime, arrTime, duration, price). Suppose a user is preparing her travel expense report, where she is only interested in the price of the connection. She then defines the derived view germanRailwaysByDeptproj with the output signature qproj = (price):
69
70
1
@prefix wdsdl: .
2 3 4 5
a wdsdl:DerivedView; wdsdl:hasBaseView ; wdsdl:hasOutputVariableProjection _:price.
The WDSDL fragment needs to appear in the same RDF graph as the one in Example 20, which contains the definition of the base view, since otherwise the local reference to (the blank node) :price is not resolved correctly. The evaluation is in two steps; first the base view is queried (cf. Example 18): germanRailwaysByDept( ¨ (start/“Freiburg”, dest/“Gottingen”, date/“03.02.2009”, desiredDeptTime/“08:00”)) = { (deptTime/“08:57”, arrTime/“13:07”, duration/“4:10”, price/“95.00”), (deptTime/“09:03”, arrTime/“14:48”, duration/“5:45”, price/“85.00”), . . . }
Then the projection is performed and the result of the derived view is obtained as: germanRailwaysByDeptproj ( ¨ (start/“Freiburg”, dest/“Gottingen”, date/“03.02.2009”, desiredDeptTime/“08:00”)) = { (price/“95.00”), (price/“85.00”), . . . }
Column Modifiers There are different situations where it is useful to adapt selected “columns” of each tuple of the result variable bindings. Consider for example the case of the German railways portal that only returns the departure and arrival time in the format “HH:mm”, which raises problems when the train arrives the next day (cf. Example 22). For this type of derived view, the signature is identical to the signature of the base view and it has an intuitive operational representation as a tuple trigger consisting of two parts: a trigger condition that checks whether the trigger should be evaluated or not and a statement that specifies the low-level operations to be executed. As indicated by the name “trigger”, for each tuple in the resulting variable bindings of the base view, the condition is checked, and if is satisfied, the statement is executed, possibly altering the bound variables in the tuple. The condition and the statement part of the tuple trigger are defined with XQuery where the communication is realized via initially bound variables and a mapping of the result variables (serialized as XML fragment) to the associated tags as illustrated further in Example 22. The extension to the WDSDL ontology is then mostly self-explanatory, with the exception that statements and conditions are defined as opaque strings (= XQuery fragments):
70
Chapter 4. The Annotation Layer
1 2 3 4
@prefix @prefix @prefix @prefix
owl: rdfs: wdsdl: xs:
71
. . . .
5 6 7 8
# Classes wdsdl:ResultVarDefinition a owl:Class. wdsdl:TupleTrigger a owl:Class.
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
# Properties wdsdl:hasTupleTrigger rdfs:domain wdsdl:DerivedView; rdfs:range wdsdl:TupleTrigger. wdsdl:hasCondition rdfs:domain wdsdl:TupleTrigger; rdfs:range xs:string. wdsdl:hasStatement rdfs:domain wdsdl:TupleTrigger; rdfs:range xs:string. wdsdl:hasResultVariable rdfs:domain wdsdl:TupleTrigger; rdfs:range wdsdl:ResultVarDefinition. wdsdl:attName rdfs:domain wdsdl:ResultVarDefinition; rdfs:range xs:string. wdsdl:mapToVar rdfs:domain wdsdl:ResultVarDefinition; rdfs:range wdsdl:Tag. wdsdl:datatype rdfs:domain wdsdl:ResultVarDefinition; rdfs:range xs:anySimpleType. wdsdl:format rdfs:domain wdsdl:ResultVarDefinition; rdfs:range xs:string.
Example 22 The German railways portal only returns the departure and arrival time as “HH:mm” for the connection, which raises problems if the train arrives at the next day. The time information can only be correlated correctly with respect to the corresponding input variables, which contain the start date. The derived view definition specifies the appropriate condition and statement:
1 2
@prefix wdsdl: . @prefix xs: .
3 4 5 6 7 8
end
→ overall result =
Sn
i=1
slicei
Figure 4.2 (and Example 23) further illustrates this approach: the (base) view w is initially queried with the base variable (denoted as vbase , e.g. desiredDeptTime) bound to a start value. Then, one of the result variables (denoted as vchain , e.g. deptTime) is used to obtain new values (denoted as next) for vbase in round 2. Typically, more than one result tuple is returned for a query and consequently the maximum value of vchain (wrt. to the order associated with the underlying data type of the values bound to vchain , e.g. for xs:time the latest time point is chosen) over all returned tuples is chosen. The computation continues until the new value chosen for next is greater than (again wrt. to the order associated with the underlying data type) the user-defined upper bound (denoted as end) for the variable vbase . The overall result is then the union of the results of each round (i.e. all retrieved slices). The signature of the associated derived has the same output variables like the base view, but the input signature represents the new access granularity, i.e. vbase is not required as input variable anymore (since all values for this input variable are determined
73
74 automatically). Thus, the output signature of the derived view is qacc = qin \ vbase , and the overall signature of the derived view wacc is qout ← wacc (qacc ). The operational semantics of the derived view is given by this recursive pattern (in a pseudo-algorithmic notation): 1. query the base view, where vbase is initialized with the start value (and the others are analogously initialized to the original query), 2. now, determine the new value for next (as described above), and 3. if next > end then stop, otherwise continue with step (4). 4. query the base view, where vbase is bound to next (and the other variables are initialized to the same values as before) and continue with step (2). The according WDSDL extension is shown here:
1 2 3 4
@prefix @prefix @prefix @prefix
owl: rdfs: wdsdl: xs:
. . . .
5 6 7
# Classes wdsdl:RestrictionDefinition a owl:Class.
8 9 10 11 12 13 14 15 16 17 18 19
# Properties wdsdl:hasRestriction rdfs:domain wdsdl:DerivedView; rdfs:range wdsdl:RestrictionDefinition. wdsdl:onBaseVar rdfs:domain wdsdl:RestrictionDefinition; rdfs:range wdsdl:Tag. wdsdl:chainBy rdfs:domain wdsdl:RestrictionDefinition; rdfs:range wdsdl:Tag. wdsdl:startWith rdfs:domain wdsdl:RestrictionDefinition; rdfs:range xs:anySimpleType. wdsdl:endWith rdfs:domain wdsdl:RestrictionDefinition; rdfs:range xs:anySimpleType.
Note that vbase needs to be an input variable, while vchain has to be an output variable. Also, vbase and vchain (and consequently the start and end values) must range over the same simple XML Schema data types [PGM+ 09]. Example 23 The German railways portal http://www.bahn.de only returns train connections for the next approximately four hours starting from a specific departure time, but it is not possible to query for all train connections on one day between two cities. A straightforward solution for this would be to design a recursive process fragment that uses the departure time of the latest connection as input for the next query. A cleaner solution is to define a derived view that captures this approach declaratively:
74
Chapter 4. The Annotation Layer
1 2
75
@prefix wdsdl: . @prefix xs: .
3 4 5 6 7 8 9 10 11
a wdsdl:DerivedView; wdsdl:hasBaseView ; wdsdl:hasRestriction [ wdsdl:onBaseVar _:dDeptT; wdsdl:startWith "00:01:00" ^^xs:time; wdsdl:chainBy _:deptT; wdsdl:endWith "20:00:00" ^^xs:time ].
If the derived view getConnectionByDate is now queried for all connections between Freiburg and G¨ottingen on May 4th 2010, the evaluation is executed stepwise. Initially, the base view is queried where the departure time is set to “00:01” (the start value): slice1 := germanRailwaysByDept( ¨ (start/“Freiburg”, dest/“Gottingen”, date/“04.05.2010”, desiredDeptTime/“00:01”)) = { (deptTime/“00:14”, arrTime/“06:54”, duration/“06:40”, price/“97.00”), . . . , (deptTime/“05:52”, arrTime/“10:01”, duration/“04:09”, price/“97.00”)}
In the next round, the same view is queried but this time using the latest departure time (“05:52”) obtained in the last round: slice2 := germanRailwaysByDept( ¨ (start/“Freiburg”, dest/“Gottingen”, date/“04.05.2010”, desiredDeptTime/“05:52”)) = { (deptTime/“05:52”, arrTime/“10:01”, duration/“04:09”, price/“97.00”), . . . , (deptTime/“07:49”, arrTime/“11:41”, duration/“03:52”, price/“97.00”)}
This continues until the last round where the departure time “19:56” is used as input: slicen := germanRailwaysByDept( ¨ (start/“Freiburg”, dest/“Gottingen”, date/“04.05.2010”, desiredDeptTime/“19:56”)) = { (deptTime/“19:56”, arrTime/“05:06”, duration/“09:10”, price/“115.00”), . . . , (deptTime/“22:57”, arrTime/“06:11”, duration/“07:14”, price/“88.00”)}
Now the latest departure time is later than the specified end time and the output of the derived view is computed as the union of the results of all rounds: ¨ getConnectionByDate((start/“Freiburg”, dest/“Gottingen”, date/“04.05.2010”)) =
75
Sn
i=1
slicei
76 Implementation Details The implementation of derived views is oblivious of the base view implementation. Intuitively, from a programming perspective, there is an interface (or abstract class depending on the programming language) BaseView that takes a set of tuples of variable bindings as input and returns a set of tuples of variable bindings as output. The implementation of the derived views then directly mimics the operational semantics that has been introduced in this section. Concrete derived views are dynamically instantiated based on their RDF description. Now that all the details of the semantic annotation of Web data sources have been presented, the next section showcases how they relate to two selected standards for Web Service sources.
4.5. Use Case: Annotation of Web Service Sources The presentation of Web data sources has so far been focused mostly on a top level perspective on Web data sources and their signatures. This section gives an overview of two of the most prevalent types of Web Service sources: SOAP/WSDL and REST. Section 4.5.1 deals with SOAP/WSDL, which enjoys great popularity in the industry and was founded by a concerted effort of different companies, while Section 4.5.2 introduces REST, which has its roots in the Ph.D. thesis of Roy Fielding [Fie00] and leverages the HTTP protocol [FGM+ 99] and XML technologies. The focus is here on the annotation of both kinds of Web Service sources with the WDSDL ontology, thus abstracting away the lower-level details of the actual access to these sources.
4.5.1. SOAP/WSDL There are two different metaphors in which to think about the industry-driven side of Web Services. One is to view them as yet another “layer cake” of technologies to achieve intercommunication between different programming languages via the Web, the other is to view it from the perspective of the participating parties. The former view is depicted in Figure 4.3: at the bottom resides the HTTP protocol [FGM+ 99], which is shown in parentheses to stress that other transport protocols, such as SMTP2 [Pos82] could be used as well. This layer is concerned with the transport of messages between two computers. Above this layer, the function call with the according parameters is encoded in a proprietary XML format (either SOAP [ML07, GHM+ 07a, GHM+ 07b] or XML-RPC [Win03]). Compared to the MARS technology stack, the SOAP/XML-RPC is similar to a combination of the XML serialization of variable bindings with the DWQL or WSQL XML language fragment. The Web Service Description Language (WSDL) [CCMW01, CMRW07, CHL+ 07] describes the interfaces of func2
The Simple Mail Transfer Protocol (SMTP) describes the safe and efficient delivery of e-mails.
76
Chapter 4. The Annotation Layer
77
tion calls as operations and their exchanged messages. It also provides the information where the service is located. In the MARS world the description of the service is done in WDSDL, while the technical details of the service, such as how to communicate with it and where to find it are delegated to the LSR. The next layer is the Universal Description Discovery & Integration (UDDI) [CHvRR04] infrastructure, geared towards the description and discovery of Web service providers, as well as the services they offer and the technical service details (usually in form of a WSDL file). In part this can be compared to the LSR in the MARS framework, which among others can be queried about the language processors. The final layer is the Web Services Business Process Execution Language (WS-BPEL) [AAA+ 07], which is a block-structured orchestration language that allows to specify a business process as a series of Web Services that have to be called in a specific order. As such, it shares similarities with RelCCS (cf. Chapter 7), the language that is used for defining query workflows. Note that the stack (or layer cake) shown in Figure 4.3 is not a normative way of looking at these kind of Web Services but reflects the view of the author on this topic. Figure 4.3 Web Service “layer cake” WS-BPEL UDDI WSDL SOAP/XML-RPC (HTTP)
Another way to introduce SOAP/WSDL-style Web Services is by looking at the interaction between the participating partners during the Web Service life cycle as depicted in Figure 4.4. The life cycle begins with a service provider that publishes a service description (in WSDL format) to a UDDI service registry (1). The service consumer queries the registry to find a service that matches the desired goal criteria and receives the WSDL source description (2). The WSDL file contains all the information to transparently use the service and the consumer can invoke the service at the service provider (3). Because WSDL interface descriptions capture a subset of the information that WDSDL can express, it is straightforward to define a high-level structural mapping from a WSDL description to the accompanying WSDL file (cf. Table 4.1 and Example 24).
77
78 Figure 4.4 Web Service life cycle Publish service (1) description Service Provider
Service Registry
(2) (3) Invoke service
Query for service description
Service Consumer
WSDL WDSDL operation view input message input signature output message output signature message part tag Table 4.1.: Mapping from WSDL to WDSDL Example 24 The xurrency Web Service3 offers a set of APIs devoted to currency exchange, e.g. to convert amounts from one currency to another. This API can be characterized as the view (tAmount) ← convertCurrency(bAmount, bCode, tCode), where bAmount refers to the base amount, bCode to the currency code of the base currency and tCode to the currency code of the target currency, and tAmount to the target amount, respectively. The WSDL description of this API is shown here as excerpt:
...
...
3
http://xurrency.com/
78
Chapter 4. The Annotation Layer
79
...
...
...
...
The view is called getValues in the WSDL file and is a operation in WSDL lingo. The associated input and output messages specify the input and output message parts (or tags in WDSDL lingo) and their associated types. All the APIs of the Web Service are bundled by a portType. The concrete binding information is not included here, and the service port binding information is only included to show all the salient parts of a WSDL file. Now, the WDSDL file can be constructed by following the transformation rules in Table 4.1:
1 2
@prefix xsd: . @prefix wdsdl: .
3 4 5 6 7
a wdsdl:WebServiceSource; wdsdl:baseURL ; wdsdl:providesView ; wdsdl:hasTag _:bCode, _:tCode, _:bAmount, _:tAmount.
8 9 10 11 12
_:bCode a wdsdl:Tag; wdsdl:name "baseCode" ; wdsdl:datatype xsd:string. _:tCode a wdsdl:Tag; wdsdl:name "targetCode" ; wdsdl:datatype xsd:string.
79
80
13 14 15 16
_:bAmount a wdsdl:Tag; wdsdl:name "baseAmount" ; wdsdl:datatype xsd:float. _:tAmount a wdsdl:Tag; wdsdl:name "targetAmount" ; wdsdl:datatype xsd:float.
17 18 19 20 21
a wdsdl:WebServiceView; wdsdl:hasInputVariable _:bAmount, _:bCode, _:tCode; wdsdl:hasOutputVariable _:tAmount.
Note that the tag types directly correspond to the ones given in the WSDL file.
4.5.2. Representational State Transfer (REST) Representational State Transfer (REST) is a general architectural style for designing distributed hypermedia systems, such as the World Wide Web (WWW) [Fie00]. To make a design RESTful, it has to adhere to a set of constraints or guiding principles, such as providing a uniform interface and a distinction between the resources and their representations (cf. [SS09]). Like in RDF, URIs are the preferred way to identify these resources. The REST principles can also be applied to create Web Services by leveraging the inherent power of the HTTP protocol, which already offers a vocabulary in terms of GET, PUT, POST, and DELETE methods [RR07]. In the course of this thesis, only the GET method is relevant, because the other methods are used to modify the data on the server. Figure 4.5 highlights the interaction between the client and the server in a REST-style Web Service architecture. The client encodes its request as a HTTP GET request (instead of sending an XML request as payload of a POST request as is the case in SOAP/XMLRPC) (1). The input variables are encoded in the search part of the URL4 as shown in Example 25 and discussed in detail in Chapter 6. The server processes the request (e.g. by querying an internal database) and returns the result in an “arbitrary” format, e.g. XML or the JavaScript Object Notation (JSON)5 (2). Figure 4.5 Overview of REST-style (read-only) Web Service architecture (1) HTTP GET request Client
Server
(2) Result in XML, JSON, . . . format 4 5
The part of the URL after the “?”. http://www.json.org/
80
Chapter 4. The Annotation Layer
81
The input signature of a RESTful Web Service corresponds to the key = value parameters used in the URL encoding of the HTTP GET request and the transformation to WDSDL is essentially a one-to-one mapping from the keys to tags in the WDSDL description. However, it is not possible to specify a generic mapping for the output signature as was the case for WSDL-style Web Services. The reason is that RESTful Web Services can return their results in different formats adhering to “arbitrary” schemas. Consequently, the output signature has to be determined manually, and the result has also to be parsed for each service differently. Example 25 Travenjoy is a Spain-based meta search engine for flight connections that offers a RESTful Web Service interface6 . A typical request requires information about the IATA7 code of the originating airport, the IATA code of the destination airport and the date of the journey. The full signature of the view is (airlineCode, departure, arrival, departureTime, arrivalTime, stop, taxes, baseFare, totalFare) ← flightSearch(departure, arrival, goingDate) where the output signature will be motivated soon.
Now, if a user wants to query for flights on December 2nd, 2010 from Frankfurt (IATA code “FRA”) to New York (IATA code “JFK”), this HTTP GET request would be automatically generated and sent to the Travenjoy server: http://.../getFlights?departure=FRA&arrival=JFK&goingDate=20101202...
The query returns 28 results in an XML document:
1 2 3 4 5 6
US US Airways ...
7
LGW Londres - Gatwick (LGW)
8 9 10 11 12 13
...
14
15
6
Actually, at the time of writing, the service has been discontinued and the search engine is now available as Sourceforge project at http://www.travenjoy.com. 7 International Air Transport Association
81
82
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
< urren y>EUR 633.9 Supersaver 171.45 805.35 ... BA JFK 2009-10-02T22:30:00 FRA 2009-10-02T14:10:00 1 LHR ...
28 Here, first all airlines and airports are listed with their keys and context information. Afterwards the found flights are enumerated, where for each flight the airline code, departure and arrival IATA code, departure and arrival time, the (potential) intermediate stops, and detailed pricing information (divided into base fare, taxes and total fare) is returned, which facilitates the abovementioned signature. The corresponding WDSDL description captures the input, and output signature, as well as the associated data types, formats, dimensions, and units:
1 2 3 4
@prefix @prefix @prefix @prefix
dim: curr: xs: wdsdl:
. . . .
5 6 7 8 9 10
a wdsdl:WebServiceSource; wdsdl:baseURL ; wdsdl:providesView ; wdsdl:hasTag _:departure, _:arrival, _:goingDate, _:author, _:airlineCode,
82
Chapter 4. The Annotation Layer
11 12
83
_:departureTime, _:arrivalTime, _:stop, _:taxes, _:baseFare, _:totalFare.
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
_:departure a wdsdl:Tag; wdsdl:name "departure" ; wdsdl:datatype xs:string. _:arrival a wdsdl:Tag; wdsdl:name "arrival" ; wdsdl:datatype xs:string. _:goingDate a wdsdl:Tag; wdsdl:name "goingDate" ; wdsdl:datatype xs:date; wdsdl:format "yyyyMMdd" . _:airlineCode a wdsdl:Tag; wdsdl:name "airlineCode" ; wdsdl:datatype xs:string. _:departureTime a wdsdl:Tag; wdsdl:name "departureTime" ; wdsdl:datatype xs:timestamp; wdsdl:format "yyyy-MM-dd’T’hh:mm:ss" . _:arrivalTime a wdsdl:Tag; wdsdl:name "arrivalTime"; wdsdl:datatype xs:timestamp; wdsdl:format "yyyy-MM-dd’T’hh:mm:ss" . _:stop a wdsdl:Tag; wdsdl:name "stop" ; wdsdl:datatype xs:string. _:taxes a wdsdl:Tag; wdsdl:name "taxes" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:baseFare a wdsdl:Tag; wdsdl:name "baseFare" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:totalFare a wdsdl:Tag; wdsdl:name "totalFare" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:EUR.
34 35 36 37 38 39 40
a wdsdl:WebServiceView; wdsdl:hasInputVariable _:departure, _:arrival, _:goingDate; wdsdl:hasOutputVariable _:airlineCode, _:departure, _:arrival, _:departureTime, _:arrivalTime, _:stop, _:taxes, _:baseFare, _:totalFare.
Note how the annotated data types mimic the data types found in the XML result. The next section discusses how the annotation of single Web data sources can be leveraged for the annotation of variables in query workflows.
4.6. Annotation of Query Workflows The dataflow in MARS rules and query workflows is organized via tuples of variable bindings (cf. Chapter 3). The variables of query workflows are optionally also annotated with a datatype and a dimension. This can be done automatically for most variables by
83
84 analyzing the query workflow and its variable usage if the used Web data sources are annotated accordingly. For annotation with units, either every single value can be annotated (which would require to store the unit as an additional column in the underlying database), or the variable is annotated once (usually based on the annotation of the source where the values originate from), and every value is transformed to that unit. Because all sources are annotated and the overhead is significantly smaller, the latter alternative is used. Example 26 In the railway example, the price of a connection is annotated with the respective currency. The complete DWQL source description in RDF (N3) format is given in Example 20 on page 67. It lists the tags used by the source, the views provided by the source, and for each view which tags are used in it as input or output. The unit of a specific tag (here price) can be queried with this SPARQL query:
1
prefix wdsdl:
2 3 4 5 6 7
select ?U where { ?S wdsdl:providesView . ?S wdsdl:hasTag [ wdsdl:name "price"; wdsdl:unit ?U ] }
It yields the URI . Such queries are used when the domains/units of variables of query workflows are derived. Note that query workflows over homogeneous sources, e.g. which all use kilometers and e, work well even without explicit annotation. The annotation becomes important when the sources use different units. Communication with Sources As the annotations of the sources include the units that are required for the input variables, the values sent to the sources are converted accordingly (wrt. units, and also wrt. the syntactic representation, e.g. in case of time and date) by the MARS framework. The returned values are also converted if required. Value Tolerance In most cases, values like longitude/latitude are not intended as exact values, or the result of the conversion between miles and kilometers does usually not coincide with the values provided in the query workflow. For that, variables are also annotated with an absolute or relative tolerance when values should be regarded as “the same”, e.g. in joins. This is formalized in this ontology snippet:
84
Chapter 4. The Annotation Layer
1 2 3 4
@prefix @prefix @prefix @prefix
mars: owl: rdfs: xs:
85
. . . .
5 6 7
# Classes mars:Variable a owl:Class.
8 9 10 11 12 13
# Properties mars:relTolerance rdfs:domain mars:Variable; rdfs:range xs:anySimpleType. mars:absTolerance rdfs:domain mars:Variable; rdfs:range xs:anySimpleType.
Example 27 The following informally stated query workflow illustrates the benefits of annotating workflow variables with a tolerance threshold. Consider searching for used cars satisfying certain properties, where the first query is against a technical database, e.g. resulting in a tuple t1 = (make/”VW”, type/”Passat”, displ/”1984”, power/”85”) The variables are annotated as follows:
1 2 3 4 5
@prefix @prefix @prefix @prefix @prefix
dim: . mars: . unit: . wdsdl: . xs: .
6 7 8 9 10 11
_:type a mars:Variable; wdsdl:datatype xs:string. _:displ a mars:Variable; wdsdl:dimension dim:CubicVolume; wdsdl:unit unit:cubicCentimeter; mars:relTolerance "5" ^^xs:int. _:power a mars:Variable; wdsdl:dimension dim:Power; wdsdl:unit unit:kilowatt; mars:absTolerance "1" ^^xs:int.
Sources for used cars use different nomenclature: the same car might here be listed as a “VW Passat, 2.0l, 115HP” (HP = horsepowers). Assume the source to be queried to “return all data about available VW Passats”. One answer tuple is e.g. t2 = (make/”VW”, type/”Passat”, displ/”2.0”, power/”115”),
85
86 where the tag displ is annotated as cubic content in liters and the tag power is annotated as power in horsepowers as shown in this excerpt of the source description: 1 2 3 4
@prefix @prefix @prefix @prefix
dim: unit: wdsdl: xs:
. . . .
5 6 7 8 9 10 11 12 13 14 15 16 17
_:m a wdsdl:Tag; wdsdl:name "make" ; wdsdl:datatype xs:string. _:t a wdsdl:Tag; wdsdl:name "type" ; wdsdl:datatype xs:string. _:d a wdsdl:Tag; wdsdl:name "displ" ; wdsdl:dataype xs:decimal; wdsdl:dimension dim:CubicVolume; wdsdl:unit unit:liter. _:p a wdsdl:Tag; wdsdl:name "power" ; wdsdl:datype xs:int; wdsdl:dimension dim:Power; wdsdl:unit unit:hp.
The task of the matching is now not only to convert 115HP into kilowatt, which is 85.76 kW, and to match within a given absolute tolerance, here due to rounding, ±1 kW, but also to convert 2.0l into cubic centimeter, which is 2000 cm3 and then to match within the relative tolerance of the lower and upper bound. This condition is also met, since 2000 is within a ±5% range of 1984, and thus the two tuples t1 and t2 denote the same car wrt. the user-defined tolerance. Note that the relative tolerance bound is always computed based on the smaller of the two values.
4.7. Outlook: Query Workflows over Web Data Sources The basic embedding pattern for queries has been shown in Chapter 3. For DWQL queries, all relevant information for communication with the DWQL engine need to be provided. Usually, the variable names used in the query workflow do not coincide with the tag names of the DWQL views (in the same way as in programming, the variables in a method call do not coincide with the formal parameters of a method definition). As DWQL views are not positional (which would mean that the arguments are ordered), but slotted, the pattern has to indicate how the workflow variables are mapped to the view’s variables/tags (and vice versa for the result variables) as illustrated next. Use Case The use of the schedule of German railways as a Deep Web source has been introduced in Example 20 on page 67. For international connections, for instance, from
86
Chapter 4. The Annotation Layer
87
Freiburg to Poznan, the prices are not always returned. A suitable strategy is here to look up connections to the stations near to the border against the railway source in the origin country, and from these stations to the destination in the railway source of the destination country (and analogously for connections that run through three or more countries). Note that the necessity for conversion of prices naturally emerges in this situation. The approach is illustrated using the connection from Freiburg to Poznan, using http://www.bahn.de for German railways and http://www.pkp.pl for Polish railways as Deep Web sources (the details of accessing Web data sources are described in Chapter 6). The full query workflow can be specified in RelCCS [HML09] (cf. Chapter 7) like this: 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
37 38 39 40 41 42 43 44 45 46 47
specifications for other destination countries -->
The workflow is simplified such that it applies only to travels to direct neighbor countries. Also, only the relevant parts are shown and a view ((borderStation) ← hasBorderStation(startC, destC)) is assumed that can be queried for border stations for each neighboring country. First, the variables start (the start city), startC (the start country), dest (the destination city), destC (the destination country), date, and time, are bound to their initial values, resulting in the single tuple: (start/“Freiburg”, startC/“D”, dest/“Poznan”, destC/“PL”, date/“27.04.2009”, time/“09:00”) . Then, the first query (actually evaluated against a travel database) binds the additional variable borderStation (Line 8), depending on the values of destC. Since there are three border stations known for traveling to Poland, three tuples are generated, namely: from
fromC
to
toC
borderStation
date
time
Freiburg Freiburg Freiburg
D D D
Poznan Poznan Poznan
PL PL PL
Szczecin Frankfurt(Oder) ¨ Gorlitz
“27.04.2009” “27.04.2009” “27.04.2009”
“09:00” “09:00” “09:00”
With these tuples, the first DWQL query is evaluated (Line 14). The tuples are projected and renamed according to the eca:has-{input|output}-variable specifications (borderStation is used as dest, and the additional input variable start can be determined by analyzing the query fragment and is not listed explicitly, cf. Chapter 3) and the view germanRailwaysByDept is retrieved for the input tuples: { (start/“Freiburg”, dest/“Szczecin”, date/“. . . ”, desiredDeptTime/“09:00”), (start/“Freiburg”, dest/“Frankfurt(Oder)”, date/“. . . ”, desiredDeptTime/“09:00”), ¨ (start/“Freiburg”, dest/“Gorlitz” , date/“. . . ”, desiredDeptTime/“09:00”) }
88
Chapter 4. The Annotation Layer
89
returning the following answer tuples: {( start/“Freiburg”, dest/“Szczecin”, date/“. . . ”, desiredDeptTime/“09:00”, deptTime/“09:49”, arrTime/“18:48”, duration/“8:59”, price/“131.20”), : ( start/“Freiburg”, dest/“Frankfurt(Oder)”, date/“. . . ”, desiredDeptTime/“09:00”, deptTime/“09:49”, arrTime/“17:26”, duration/“7:37”, price/“127.00”), ( start/“Freiburg”, dest/“Frankfurt(Oder)”, date/“. . . ”, desiredDeptTime/“09:00”, deptTime/“09:49”, arrTime/“17:30”, duration/“7:41”, price/“131.00”), : ¨ ( start/“Freiburg”, dest/“Gorlitz” , date/“. . . ”, desiredDeptTime/“09:00”, deptTime/“10:57”, arrTime/“19:27”, duration/“8:30”, price/“127.00”), : }. Note that the result is just a set of tuples, not a set of groups of tuples and although the underlying Deep Web source does not support a set-oriented query interface, DWQL provides a set-oriented interface and iterates internally. The tuples are then unrenamed (dest → borderStation (for joining), arrTime → arrBorderTime, and price → P1). Then, the workflow enters the appropriate alternative branch for querying the railway company in the destination country, where the Polish railways page is annotated with the same signature. For the input to the next query, the renaming is borderStation → start and arrBorderTime → desiredDeptTime. The query returns for each tuple the connecting trains from the respective border station to Poznan (Line 29). The resulting tuples, amongst them: (start/“Frankfurt(Oder)”, dest/“Poznan”, date/“. . . ”, desiredDeptTime/“17:26”, deptTime/“17:33”, arrTime/“19:27”, duration/“1:54”, price/“22.00”) are then unrenamed (start → borderStation, desiredDeptTime → arrBorderTime, and price → P2) and joined afterwards with the before tuples (where the values of borderStation and arrBorderTime are the actual join condition). Finally, Price is obtained as P1 + P2, considering the different currencies as described below. Reasoning about Workflow Variables Variables in a RelCCS query workflow can be typed, including information about measurements and units. In the above example, the prices are typed and the required date formats are annotated and can thus be managed by the MARS framework transparently for the user. The MARS knowledge about the query workflow is depicted in this RDF fragment:
89
90
1 2 3 4 5 6
@prefix @prefix @prefix @prefix @prefix @prefix
curr: dim: mars: travel: wdsdl: xs:
. . . . . .
7 8 9 10
[ a mars:Process; mars:useVariables _:start, _:startC, _:dest, _:destC, _:date, _:time, _:border, _:arrBT, _:p1, _:p2, _:pr ].
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
_:start a mars:Variable; mars:name "start" ; wdsdl:datatype xs:string. _:startC a mars:Variable; mars:name "start" . # derived from the first query _:dest a mars:Variable; mars:name "dest" ; wdsdl:datatype xs:string. _:destC a mars:Variable; mars:name "start" . # derived from the first query _:date a mars:Variable; mars:name "date" ; wdsdl:datatype xs:date; wdsdl:format "dd.MM.yyyy" . _:border a mars:Variable; mars:name "borderStation" . _:time a mars:Variable; mars:name "time" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:arrBT a mars:Variable; mars:name "arrBorderTime" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:arrT a mars:Variable; mars:name "arrTime" ; wdsdl:datatype xs:time; wdsdl:format "HH:mm" . _:p1 a mars:Variable; mars:name "P1" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:p2 a mars:Variable; mars:name "P2" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:PLN. _:pr a mars:Variable; mars:name "price" ; wdsdl:datatype xs:decimal; wdsdl:dimension dim:price; wdsdl:unit curr:EUR.
It is derived completely from the workflow structure and the DWQL source descriptions. The derivation of the variables’ properties is similar to static typing in programming languages. While most of the properties are straightforward, the prices deserve attention: P1, which is the answer from the German railways portal, is known to have the unit curr:EUR while P2, which is the answer from the Polish railways portal, has the unit curr:PLN. Price, which is derived as P1 + P2 was annotated by the user with the desired goal unit curr:EUR. When computing Price := P1 + P2, for the above sample connection, P2, e.g. 22 PLN, will be converted in 4.87e before being added. The source annotations also leverage some (automatic) verification of the workflow’s correctness, e.g. the correct use of dimension-compatible answers.
90
Chapter 4. The Annotation Layer
91
4.8. Related Work The modeling of heterogeneous data sources is an essential prerequisite for later integration or mediation of these sources. Seminal work in this area has been done in the TSIMMIS project [LYV+ 98]. There, source capabilites are modeled as templates in the rule-based language MSL [PAGM96]. These templates capture the input and output signature of the underlying sources. Higher-level queries, that are not supported directly by the query source, e.g. queries with multiple predicates, are transparently handled by wrappers that divide the query task appropriately into the part that can be handled natively by the query source and the part that is executed by the wrapper [HGMN+ 97]. This is similar to the definition of derived views as introduced above, which also operate transparently on base views that are comparable to the template description of the source capabilities. The templates are associated to (simple or complex) data types and results are represented in the Object Exchange Model [PGMW95]. In contrast, the approach presented in this thesis uses a combination of the XML Schema datatypes with an additional grounding in the Semantic Web for associating tags and variables with (physical and non-physical) dimensions. Work in the area of accessing Deep Web sources also has as one aspect the modeling of such sources. The Information Manifold system [LRO96] uses a Local-as-View (LAV) approach [Len02] to data integration and, more specifically, to the description of data sources. A world view is introduced that ranges over virtual relations and classes specified in a Description Logic [BCM+ 03]. Source descriptions are then modeled as queries over this world view with additional consideration of source capabilities, such as the input and output signature of the source or constraints (e.g. a source may only return cars that are more expensive than 50000e). The way sources are modeled is closely related to the approach in this thesis. Here, also a LAV modeling of the data sources is used: the input and output signature is given with respect to tags. Information Manifold uses the constraints for pruning the search space of feasible query plans. Since this thesis is focused on manually created query workflows, constraints on sources do not play such a dominant role, but could be easily included. Information Manifold does not support the concept of derived views directly, which in comparison allows for a more natural and generic modeling of data sources in this thesis. The Web-at-a-Glance (WAG) system [CCPS00] uses a semi-automatic approach to classify Deep Web sources based on a conceptual model of the domain of interest given in the Graph Model [CSC97]. The Web form representing the entry point (and query capabilities) of the Deep Web sources is expressed as a conjunctive view (or at most a union of conjunctive views) over this domain schema. The representation of the data sources resembles the Information Manifold approach (exchanging the Graph Model for an appropriate Description Logic), while it lacks an exhaustive support for constraints. More fine-grained modeling as offered by derived views is also not supported on the modeling layer.
91
92 More recent work in this area includes complementary research on the (semi-)automatic matching of different (Deep Web) query interfaces with the goal to offer a unified interface for query interfaces belonging to a specific domain of interest in form of a vertical search engine [WYDM04]. In [ZHC04], the focus is on the automatic extraction of form semantics (including determining the range of a form field, i.e. input variable), while in [SMHY07] an algorithm for the automatic detection of valid queries, i.e. queries that might yield results, is presented. The emphasis is here not so much on the associated data types of the Web form but on the valid combination of different input attributes of the form. Finally, [KDYL09] deals with the automatic extraction and mapping of (Deep Web) query interfaces into a hierarchical representation that is amenable to further tasks, such as schema matching. The abovementioned results can be put to use for the automatic derivation of base view descriptions and are also relevant for the technical realization of the access to Deep Web sources as presented in Chapter 6. A classic research question with respect to data sources with limited access patterns is to automatically find a feasible ordering of a set of such sources for answering a conjunctive query. Typically, the proposed solutions follow a Global-as-View (GAV) [Len02] approach, i.e. the global view is defined over the local view descriptions, such as [NL04a]. There, query interfaces are represented solely by their signature qout ← w(qin ) (cf. Section 4.3) without consideration for data types or any further annotations. Consequently, the feasability is only considered on the basis of the signature, as well. This perception is extended in [CM08b], where the variables qin ∪ qout additionally range over abstract domain classes (e.g. movies). The same characterization is followed in [BCDM08], where the optimization of multi-domain queries over data sources with limited access patterns based on different cost metrics is described. Besides annotations on the signature level, they also distinguish between search services, which return ranked results, and exact services, which resemble normal Web Services, and incorporate this knowledge during query planning among other context properties, such as the expected result size per invocation or the average answer time of a service. In this thesis, optimization is mainly achieved dynamically during runtime by applying top-k operations on the instance level (cf. Chapter 7) and by pruning the search space with appropriate search strategies (cf. Chapter 8). To some extent, context properties are also included in WDSDL, such as the update frequency of a Web data source – that will be introduced in Chapter 5 – that serves as a threshold, whether a result is retrieved from a cache or if the source is queried directly. Apart from Deep Web sources, there are many already machine-accessible data services on the Web. The interface of these services is commonly modeled with the Web Service Description Language (WSDL) [CCMW01, CMRW07, CHL+ 07], which has been already discussed in Section 4.5.1. To overcome the limitations of WSDL, more elaborate annotation mechanisms have been developed, giving rise to so-called Semantic Web Services. An early proposal is DAML-S [BHL+ 02]: it enhances the WSDL descriptions of input and output messages to cover abstract types for each message part and also comprises a process model ontology, which can be used to support automated Web Service
92
Chapter 4. The Annotation Layer
93
invocation, composition, and interoperation. The W3C standard “Semantic Annotations for WSDL and XML Schema” (SAWSDL) [MPW07] leverages the successor of DAML-S, OWL-S [MBM+ 07] to define a fixed set of WSDL extension attributes. They allow to use elements of an external Semantic Web framework, e.g. for mapping message parts to abstract types. With this respect, they could serve as a basis to extend the WDSDL ontology to associate tags with concepts of domain ontologies. Agarwal proposed the Semantic Web Process Description Language (SWPDL) [Aga04], which is based on an OWL ontology and allows to describe (composite) Web Services, as well as (collections of) Web pages. Composition is handled in the herein proposed framework either by derived views or by specifying the appropriate query workflow where the framework does not enforce when to use a derived view or a query workflow; depending on the application scenario, different solutions might be preferable. In [BCH05] an approach for annotating Web services is presented that allows to specify propositional and temporal constraints in addition to the regular input and output signature of a Web Service. The constraints considered in their work are mandated by side-effects of the invocation of Web Services, which is also the case for the Semantic Web Service description proposals. Since the annotations described in this thesis are solely used for collecting information, these issues are not relevant (although the annotations of [BCH05] could serve as one basis for the automatic generation of query workflows). In [dBLPF06, RdBM+ 06, VKVF08] a layered language architecture (similar to the standard Semantic Web layer cake – cf. Chapter 2) is introduced to enable automatic Web Service discovery, composition and execution. The framework considers different aspects of Web Services, such as the heterogeneity in data representation and the preand postconditions of Web Service invocations. WDSDL is also based on several layers, which have a clear cut purpose and each level builds neatly on top of the lower level. All semantic annotations are given in standard RDF and OWL constructs and are thus Semantic Web documents. Additionally, a more generic and abstract notion of Web Data Source is supported, which allows to represent and use heterogeneous data sources, ranging from Deep Web sources to Web Services. This is possible because WDSDL does not rely on a specific low level service specification, such as WSDL. The Unified Modeling Language (UML) [Gog09b] is the default language for modeling software architectures and system interactions in the field of model-driven architecture. In [TG07, KL07], the strength of UML as a graphical modeling language with a clean semantics is leveraged to facilitate the graphical modeling of Semantic Web Services. The idea is to model the Web Services with UML and accompanying languages, such as the Object Constraint Language (OCL) [Gog09a], to lower the entry barrier for users that are already familiar with the model-driven approach for modeling Semantic Web Services. In both proposals [TG07, KL07], the UML representation is first transformed to an XML format and afterwards translated to the XML syntax of OWL-S using XSLT [Ama09]. Currently, (graphical) modeling support is not available for the WDSDL language but similar mappings from UML to WDSDL could be adopted.
93
94 Another common issue is to actually find Web data sources efficiently. A solution to this problem has been proposed in [ADG+ 09], where both Web Services and Deep Web sources are considered. The idea is to use the descriptions of a set of known Web data sources for finding and annotating new ones. There, Web data sources are modeled in a LAV style as conjunctive queries over predicates from a domain ontology. After the acquisition of a new service, it is automatically provided with a wrapper so that it accepts and returns RDF fragments. This approach could serve as basis for future work on the automatic generation of WDSDL descriptions. Following the RDF paradigm, the work reported in [VCNS08] models data providing services as RDF views [CWW+ 06] for later use in service matchmaking. This kind of “modeling” culminates in [PPH+ 09], where RDF graphs are directly used as metaphor for data sources and are used as inputs and outputs in Semantic Web pipes. One of the main incentives for the semantic annotations of Web Services is the ability to do automatic selection or matchmaking of these services. Current approaches to this problem usually employ some kind of degree of match metric to find relevant services [PKPS02, BHRT03, PTA06, VCNS08, KKZ09]. For the composition of the thus found services sometimes additional user interaction is required [SPH04, AS06]. In the herein presented approach, the Web data sources are identified explicitly and uniquely. Interestingly enough, although a lot of research is geared towards Semantic Web Services, a study concerning the status of Web Service composition performed in 2007 revealed a large gap between the state of research and the reality of the Web [LYRS07]. Firstly, real-life Web services are usually not accompanied by semantic descriptions or capability specifications. Secondly, automatic invocation of these services is often not an option; often, even the documentation of the service is missing or the Web Service description changes frequently, which is cumbersome for manual invocation and maintenance of services as well. Finally (among others), the study reports on huge differences with respect to the used schemas of the Web Services. Using a service description mechanism that is not bound to a specific technology, such as WSDL-style Web Services, is a big plus in such a setting, since the low-level details of the Web data sources are lifted to a higher level. WDSDL is also not constrained to the native schema of the data source, like any LAV-based approach. What remains though, is that rapidly changing interfaces need to be maintained, which is alleviated to some extent that query workflows based on the accompanying WDSDL description need not to be changed, if the overall signature remains stable.
94
Chapter 5. The Query Layer Think of prototypes as a funny markup language – the interpretation is left up to the rendering engine. – Larry Wall (born September 27, 1954)
Contents 5.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.2. From Semantic Annotation to Query Language . . . . . . . .
96
5.3. Query Engine Overview
. . . . . . . . . . . . . . . . . . . . . .
99
5.3.1. Micro View . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.3.2. Macro View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1. Introduction In the previous chapter, WDSDL has been presented as an exhaustive formalism for annotating Web data sources. Yet, from the MARS point of view, the ontology just describes the sources on a semantic level but for leveraging these kind of sources in the MARS framework, an accompanying query language with an appropriate XML serialization is required. In fact, since the requirements for sources that are already machine-accessible and those that are not, i.e. Deep Web sources, greatly vary (cf. Chapter 6), two slightly different languages are introduced with an overview of the architectural design decisions that both languages share. Recalling the connection between namespace URIs and language processors in the MARS framework (cf. Section 3.2.1), the query engines implementing the different requirements can be comfortably distinguished using this paradigm, allowing each engine to focus on the relevant execution aspects. The Web Service Query Language (WSQL) is geared towards Web Service-like data sources, or – more technically put – data sources that are accessible via a programmable API1 . Typically, in the Web Service setting, the API may offer create, read, update, or 1
API = Application Programming Interface
95
96 delete (often referred to as CRUD) operations on the underlying data. In the context of this thesis, only the read-related operations are necessary, and the invocation of Web Service operations has no side-effects. Although not normally considered Web Services, language- or data-specific query mechanisms, such as SPARQL endpoints, which offer access to RDF repositories via the SPARQL protocol [CFT08], often provide a wealth of high-quality data. Therefore, the term programmable API is used here in the widest possible context, the main criteria being that the services are publicly accessible (via some protocol) and the returned data is uniquely relatable (e.g. identified by a specific XML tag or attribute, when the result is returned as XML document) to the corresponding output variables by a clear mapping provided by the API of the service provider. In contrast, the Deep Web Query Language (DWQL) operates solely on top of Deep Web data sources and assumes a common interaction pattern required to access the buried data. The access of this kind of data sources is non-trivial and a research area in its own right (cf. [MAAH09]). Consequently, Chapter 6 presents the technical details implemented in the DWQL engine that automates the human-geared interaction with this kind of sources and provides some insights in the labeling and extraction of the data records hidden there. As this kind of sources is designed for human consumption, a service provider typically expects only a small number of requests made by the same user (in contrast to Web Service-style APIs that often allow a larger number of invocations per day, e.g. > 5000). To cater for these special requirements, the DWQL language supports “filter pushing” as introduced in Section 5.2. However, this is the most notable difference from the WSQL language in terms of language constructs, and the two languages are introduced side-byside throughout this chapter whenever applicable. Chapter Structure Section 5.2 establishes the link between the characterization and formalization of Web Data sources introduced in Chapter 4 and the MARS world by introducing the salient concepts of WSQL and DWQL. Section 5.3 then describes the architectural patterns that both query engines share with a distinction of the micro view (the internal processing of queries inside the query engine) and the macro view (the interaction between the query engine and the MARS framework).
5.2. From Semantic Annotation to Query Language Queries have been introduced in the MARS context in Section 3.2.2 with the signature (vij+1 , . . . , vk ) ← q(vi1 , . . . , vij ), where q was the unique id of the query, and qin := (vi1 , . . . , vij ) the input variables (or input signature) and qout := (vij+1 , . . . , vk ) the output variables (or output signature). The analysis of Web data sources in Section 4.3 resulted in a similar characterization: qin again denotes the inputs and qout the outputs, while q is the URI (serving as the unique id) of the corresponding view. Thus, a preliminary XML syntax for DWQL and WSQL in DTD-like syntax is given by:
96
Chapter 5. The Query Layer
1 2 3 4 5 6 7
97
The syntax “{t1 |t2 }” is used to indicate, that either t1 or t2 can be chosen at this point. Intuitively, in the above DTD, if dwql is chosen as namespace once, the remaining selections have to be selected accordingly. Using this DTD, it is possible to uniquely address a Web data source (view) and for MARS to identify which processor is responsible for the XML fragment (by “analyzing” the accompanying namespace URI, cf. Section 3.4). What is missing is a mapping from variables used in the surrounding ECA rule or CCS process to input and output variables of the DWQL/WSQL view. The WDSDL ontology of Section 4.4 provides for the mapping of input and output variables the notion of a tag that has among other properties a name (identified by wdsdl:name). This name is unique for each view, i.e. two different views may use the same name, but for one view the names uniquely identify the corresponding tag. With this mapping from tags to variable names, the input and output characteristics can be described using the DTD for the input and output signature for generic query languages, which is repeated here:
1 2 3 4 5 6 7 8 9 10
Since the WDSDL source description already defines all input and output variables, only the variables that need to be renamed need to be stated explicitly in the corresponding XML fragment as illustrated in Example 28 and later on in Section 5.3.2. Example 28 The German railways portal http://www.bahn.de has been used before as an example for a typical Deep Web source. It provides two views, one for accessing train connections by departure time ( germanRailwaysByDept) and one for accessing them
97
98 by arrival time ( germanRailwaysByArr). If the variables are named accordingly and the process/rule expects the result variables in the same naming scheme the source has been annotated with, this XML fragment is sufficient to access the view germanRailwaysByDept:
An intermediate form, where some of the input and output variables of the same view have been renamed to match the annotation of the source has been anticipatorily highlighted in the outlook on query workflows in Section 4.7. The same calling scheme is used for WSQL queries. DWQL Extension Deep Web sources are built and designed for human consumption and do not offer a set-oriented interface, i.e. only one query (or tuple of the set of tuples of variable bindings) can be processed at a time. By taking advantage of the multi-threaded architecture of the Web server that implements the Deep Web source, this can be alleviated to some extent as will be discussed in Section 5.3.1. Regardless, it is beneficial to keep the load – the number of queries/tuples to be processed – on one Deep Web source as low as possible. This is where a concept known from the area of database optimization called filter pushing can be employed. The intuitive idea behind this approach is to apply filters as soon as possible during the evaluation of queries to minimize the number of tuples for subsequent operations. Adapted to DWQL, a Deep Web source is queried until a desired goal criterion is reached, thus pushing the “filter” criterion from the caller to the DWQL engine. For this, the result of each query is checked, if it satisfies the criterion and if so, the remaining input tuples are ignored and the overall result so far is sent back. The XML syntax extension of DWQL is depicted here:
1 2 3
The conditions for specifying the goal criterion are the same that are used in the atomic test constituents of ECA rules and CCS processes (cf. Section 3.3.1). Again, if they are in the ops namespace, they can be evaluated locally. If an opaque (indicated by eca:Opaque) test is preferred, the condition can be evaluated in any language for which a processor is available (e.g. XQuery). Example 29 Assume that a user is interested whether a specific actor has ever won an oscar in the category “Best Actor” and there is a source bestActorByYear with the following
98
Chapter 5. The Query Layer
99
signature: (actor) ← bestActorByYear(year). Then – assuming a surrounding query workflow with appropriately named variables that determines the years in which the actor was active – the following DWQL query would stop as soon as it finds a witness2 :
Jack Nicholson According to the Internet Movie Database3 , Jack Nicholson made his first appearance on the big screen in 1958 with the movie “The Cry Baby Killer” and the last movie he starred in was “Everything You’ve Got” in 2010. In total he shot at least one movie that was shown in the movie theaters in 37 distinct years, and consequently the input bindings look as follows: {(year/“1958”) , . . . , (year/“2010”)}. Although being nominated 12 times, he won the Oscar “only” three times, first in the year 1976 for “One Flew Over the Cuckoo’s Nest”, then in 1984 for “Terms of Endearment”, and 1998 for “As Good as It Gets”. Since the input tuples are organized as a set, i.e. are not by default ordered, it takes – on average – six queries to hit a year, where Jack Nicholson won an Oscar, which means 6 only 37 = 16.2% of the original input needs to be queried (on average) for this actor.
5.3. Query Engine Overview This section explains the common aspects of the DWQL and WSQL engines with a special focus on the information flow inside each engine (micro view) in Section 5.3.1, and the information flow between the engines and the MARS framework (macro view) in Section 5.3.2.
5.3.1. Micro View Most Web data sources usually offer only interfaces that accept one tuple (or request) at a time. In case of Deep Web sources the reason is obvious, because this closely relates to the more serial problem-solving of humans, i.e. a human would simulate parallelism by 2
This is a common pattern for the applicability of the until directive: scenarios where the first witness (e.g. the first flight cheaper than 50e between A and B) is sufficient to satisfy the information need. 3 http://www.imdb.com
99
100 repeated interactions with the source with different input arguments. To serve multiple parties at the same time, Web servers are usually multi-threaded and can handle multiple requests at the same time. For Web Service-style sources the interfaces are likewise tupleoriented and not set-oriented most of the time and as such the same issues arise. Both DWQL and WSQL offer a set-oriented interface to the outside, which internally translates to iterating over the set of input bindings and processing them sequentially. To speed up processing, k threads ({T1 , . . . , Tk }) operate in parallel and once a thread is finished it directly processes the next input tuple of the input bindings to achieve the maximum (allowed) amount of parallelism as illustrated in Figure 5.1. Figure 5.1 Parallel processing of input bindings
Input Bindings
T1
...
Tk
Result Bindings
Technically, to the service provider it seems as if k different parties are posing different requests at the same time. To avoid an (unintended) denial of service attack on the provider, k ≪ ∞ is chosen, where the chosen value for k might vary depending on the service provider. Intuitively, a service provider like Amazon with a vast infrastructure can be expected to deal with more parallel requests than a small one, i.e. k is not dependent on the number of input tuples. Generally speaking, using these services should be conducted in an ethical way [TS06]. Fortunately, this has been studied extensively in the context of Web Crawlers [RGM01, CGM02, BF07, Web09, ON10]. The main difference is that there the revisit policy has the advantage that different sources are crawled in parallel. Here, only one source is accessed during a DWQL/WSQL invocation. This being a general issue that is outside the scope of DWQL or WSQL but resides on the process level itself, Chapter 7 will introduce the notion of a top–k operator that culls the active sets of tuples of variable bindings based on a map function over a subset of the currently active variables. To cater for the configuration of parallelism on the annotation level, the WDSDL ontology is extended with non-behavioral requirements [SG05]:
1 2 3 4
@prefix @prefix @prefix @prefix
owl: rdfs: wdsdl: xs:
. . . .
5 6 7 8
wdsdl:nonBehavioralRequirement rdfs:domain wdsdl:WebDataSource. wdsdl:maxParallelRequests rdfs:subPropertyOf wdsdl:nonBehavioralRequirement; rdfs:range xs:int.
100
Chapter 5. The Query Layer
101
Note that wdsdl:nonBehavioralRequirement is an abstract property (i.e. only subproperties of this property are actually employed), and that the rdfs:range of this property has deliberately been left unspecified to allow for different sophistication levels ranging from full-blown Service Level Agreements (SLAs) [LSQC05] to simple properties that require only a value. Another aspect of any distributed system is caching, which in this scenario is used to reduce the load on the side of the service provider as well as to speed up querying in general. If the respective Web data source has been asked in the past with the same input parameters and the data is within a user-specified time interval, the offline data is returned as answer. Otherwise, if the data is not in the cache or it is outdated, the original source is accessed and the cache updated appropriately as depicted in Figure 5.2. Figure 5.2 Caching in DWQL and WSQL (1) bestActorByYear(year) with year/”1993” ECA/CCS engine
Cache
(4) Update Cache
(2) Cache HIT/MISS DWQL/WSQL engine
(3) Get new data
(5) Return results
Web Data Source
Example 30 Revisiting Example 29, caching can speed up processing immensely. Assume that in the (not so distant) past, another user started the Oscar query workflow asking for “Al Pacino”, then among others the year 1993 would have been queried, since he won the Oscar only in this year for the movie “Scent of a Woman”. Now, if (in this case) the DWQL engine is queried with the same year the result is obtained from the cache directly and is added to the output bindings (cf. Figure 5.2). A na¨ıve approach to caching would always use the results in the cache if there are any available. In reality, the validity of a cache entry is dependent on the source. For the bestActorByYear the validity is incidentally infinite, assuming the Oscar’s are not revoked once they have been awarded. Other sources, such as stock exchange rates are extremely volatile and may only be valid for some minutes. To solve this in a dynamic, source-dependent manner, WDSDL is extended with the non-behavioral property wdsdl:updateInterval that assigns a valid time interval to each Web data source:
1 2 3
@prefix rdfs: . @prefix wdsdl: . @prefix xs: .
101
102
4 5 6
wdsdl:updateInterval rdfs:subPropertyOf wdsdl:nonBehavioralProperty; rdfs:range xs:dateTime.
Side Note The Oscar’s are awarded in retrospect (i.e. the year after the movie was in theaters), which is handled transparently in the query workflow. Suchlike behavior cannot be simulated by automatic approaches to query plan generation [NL04a, CM08a, CM08b, BCDM08], since there, the sources are usually combined as is, but no adaptation of the instance data is supported. Apart from that, the source bestActorByYear always returns two names per year, since a male person can be awarded the Oscar in the category “Best Actor in a Leading Role”, as well as “Best Actor in a Supporting Role”. The according Oscar awards for women would be returned by the source bestActressByYear (Oscars are given each year to the best actor and best actress in these two categories).
5.3.2. Macro View Each MARS service has an entry in the Language and Service Registry (LSR) (cf. Section 3.4) that governs the communication details. To the outside, the DWQL and WSQL engine share a similar interface and thus the LSR entry is analogous, as well. For this reason, it suffices to discuss the LSR entry of the WSQL engine:
1 2
@prefix lsr: @prefix mars:
. .
3 4 5 6 7
a mars:QueryLanguage; mars:name "Web Service Query Language" ; mars:shortname "WSQL" ; mars:is-implemented-by .
8 9 10 11 12 13 14 15 16 17 18 19 20
a mars:QueryService; lsr:has-task-description [ a lsr:TaskDescription; lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "body" ; lsr:Subject "body" ; lsr:input "element request" ; lsr:variables "*" ; lsr:mode "asynchronous" .
102
Chapter 5. The Query Layer
21 22 23 24 25 26 27 28 29 30 31
103
]; lsr:has-task-description [ lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "n.a." ; lsr:Subject "n.a." ; lsr:input "item" ; lsr:variables "n.a." ].
WSQL (and DWQL) offers two tasks: to analyze the variable signature automatically and to evaluate queries. The analysis of the variable signature is realized in three steps as depicted in Figure 5.3. The Generic Request Handler (GRH) of the caller (e.g. the ECA or CCS engine) sends the XML query fragment (e.g. the one presented in Example 31) to the URL specified in the LSR for the the DWQL/WSQL engine for the task “...#analyze-variables” (1), which queries the WDSDL source description graph with SPARQL for the associated input and output variables of the view (2) and returns it to the GRH (3), which can now automatically perform the projection on the input variables and send the fragment with the input variable bindings again. Note that renamed variables must always be explicitly specified in the signature of the language fragment (cf. Section 3.2.1). Figure 5.3 Generic analysis of view signature (1) XML query fragment ECA/CCS engine DWQL/WSQL engine GRH
(2) SPARQL query WDSDL source description graph
(3) Variable usage characteristics
Example 31 The view bestActorByYear has featured prominently in Example 29 and 30. The signature of the view is (actor) ← bestActorByYear(year) and the accompanying WDSDL source description is:
103
104
1 2 3
@prefix xs: . @prefix wdsdl: . @prefix movies: .
4 5 6 7 8 9
a wdsdl:DeepWebSource; wdsdl:baseURL ; wdsdl:providesView ; wdsdl:hasTag _:year, _:actor.
10 11 12 13 14 15 16
_:year a wdsdl:Tag; wdsdl:name "year" ; wdsdl:datatype xs:int. _:actor a wdsdl:Tag; wdsdl:name "actor" ; wdsdl:datatype xs:string.
17 18 19 20
a wdsdl:DeepWebView; wdsdl:hasInputVariable _:year; wdsdl:hasOutputVariable _:actor.
The XML fragment representing a query to this view has already been shown in Example 29 and is not repeated here. If the DWQL engine receives this XML fragment it would extract the URI of the view and first determine if it is a base or a derived view. Afterwards the variable usage characteristics is collected by using the view’s URI in SPARQL templates, resulting in queries like the one shown below:
1 2
prefix wdsdl: prefix rdf:
3 4 5 6 7 8 9 10 11 12 13
select ?tagName ?tagDataType ?tagDimension ?tagUnit ?tagFormat ?tagDenotes where { wdsdl:hasInputVariable ?tagURI. ?tagURI a wdsdl:Tag; wdsdl:name ?tagName optional { ?tagURI wdsdl:datatype ?tagDataType } optional { ?tagURI wdsdl:dimension ?tagDimension } optional { ?tagURI wdsdl:unit ?tagUnit } optional { ?tagURI wdsdl:denotes ?tagDenotes } optional { ?tagURI wdsdl:format ?tagFormat } }
104
Chapter 5. The Query Layer
105
This query extracts all data pertaining to input variables and to assure that as much information as possible is returned, the facultative attributes are put in optional blocks. For query evaluation, sets of tuples of input variable bindings are accepted (indicated by lsr:variables "*" ) and the service iterates internally as elaborated above. The communication details during query evaluation are shown in Figure 5.4. Since the evaluation is asynchronous (indicated by lsr:mode), the caller (e.g. the ECA or CCS engine) sends the input variable bindings and the respective query language fragment to the DWQL/WSQL engine (1) via its GRH (Generic Request Handler, cf. Section 3.4) over an initial HTTP connection. The DWQL/WSQL engine spawns a new thread for the query and processes the query by making the required calls to the external Web data source (2). Afterwards, the results are sent back asynchronously with a new HTTP connection via the GRH of the DWQL/WSQL engine to the caller (cf. Example 32). Figure 5.4 Generic communication during query evaluation ECA/CCS engine
(1) HTTP
GRH
DWQL/WSQL engine GRH
(2)
Web data source
(3) HTTP
Example 32 Reconsider the view bestActorByYear from Example 31. After the signature has been determined, the query language fragment is sent to the DWQL engine with the input variable bindings {(year/“1993”)} with an initial HTTP connection. The DWQL engine spawns a new thread for processing the query and determines the winners for the Oscars in the categories “Best Actor” and “Best Supporting Actor”, and sends the following result variable bindings asynchronously back to the caller via a new HTTP connection: {(year/“1993”, actor/“Al Pacino”), (year/“1993”, actor/“Gene Hackman”)} This completes the presentation of the query layer. The next chapter deals with the actual access to Web data sources where the main emphasis is on making Deep Web sources machine-accessible.
105
106
106
Chapter 6. The Source Layer An interface is humane if it is responsive to human needs and considerate of human frailties. – Jef Raskin (March 9, 1943 – February 26, 2005)
Contents 6.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2. Static Web Data Sources . . . . . . . . . . . . . . . . . . . . . . 108 6.3. Dynamic Deep Web Sources . . . . . . . . . . . . . . . . . . . . 114 6.3.1. Form Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3.2. Deep Web Navigation . . . . . . . . . . . . . . . . . . . . . . . 120 6.3.3. Web Data Extraction and Data Record Labeling . . . . . . . . 125 6.3.4. DWQL Engine Architecture . . . . . . . . . . . . . . . . . . . . 128 6.4. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.1. Introduction Web data sources can be classified into two categories with respect to access patterns: Web Service sources and Deep Web sources as introduced in Chapter 4. The former were characterized as already being machine-accessible, while the latter offer no machineaccessible API and need special consideration. In terms of the validity of the data, the sources can additionally be divided into static sources, where the data remains stable for a considerable amount of time, and dynamic sources, where the data is in continuous flux. The annotation of (dynamic) Web Service sources has already been described in Section 4.5. Since the actual access to these kind of sources is trivial (since they already offer an API with a clear semantics), they are not discussed further in this section. In contrast, Deep Web sources are not designed to be consumed by machines in an automatic fashion. Although they seem to share some common aspects with RESTful Web Services
107
108 (cf. Section 4.5.2), they offer completely different challenges in making them machineaccessible. Deep Web sources may use the GET, as well as the POST HTTP method for requesting information from the server, but the returned result is always in HTML format with a focus on the presentation of the results in a browser and not on easy parsing by machines. Therefore, in contrast to the intuitive idea to treat them as a kind of RESTful Web Services where the key = value combinations that make up the input signature are not documented and need to be reverse engineered and to see the HTML result file as yet another data exchange format, a more elaborate and special-purpose approach is required. Figure 6.1 Characterization of Web data sources Web Service Deep Web Static Section 6.2 Dynamic Section 4.5 Section 6.3
Chapter Structure Figure 6.1 depicts the relationship between the type of Web data source and the respective section that covers the related details. The handling of static Web data sources (regardless if they are Deep Web sources or Web Service sources) is described in Section 6.2, while Section 6.3 presents the issues pertaining to dynamic Deep Web sources and proposes a comprehensive approach to making this kind of sources machine-accessible. Finally, Section 6.4 concludes the chapter with a presentation of related work for the topics covered in Section 6.3.
6.2. Static Web Data Sources Not all Web data sources require an online access, e.g. fact-oriented Web pages do only change in intervals and can be mined with natural language processing techniques (or simple regular expressions) to uncover the hidden meaning. One area where this is used is to bootstrap ontologies from Web pages [DVN04], or to convert human-generated knowledge into machine-readable data like in the DBpedia project [AL07]. There, normal Wikipedia1 entries are crawled, analyzed and converted into an RDF graph and exposed in a SPARQL endpoint. This can be generalized to any kind of static Web data source, or even any kind of data source whose content does not change too frequently, such as an SQL database that contains data about geographic information. Mapping Static Web Data Sources to WSQL Figure 6.2 highlights how static (Web) data sources can be made accessible via the Web Service Query Language (WSQL) that was introduced in the previous chapter. In a preprocessing step that can be arbitrarily involved, the content of the relevant data source is converted to an RDF graph adhering to a 1
http://www.wikipedia.org
108
Chapter 6. The Source Layer
109
Figure 6.2 Acquisition and query pattern for static Web data sources (1)
Preprocessing Web Service ws1
RDF graph g1 ...
(1) HTML pages
RDF graph gk–1
(1)
RDF graph gk
SQL DB
Runtime RDF Store g1 ...
(2)
gk
(3)
ECA/CCS engine WSQL engine GRH
(4)
domain-specific vocabulary (1), e.g. by crawling the relevant HTML pages and extracting the buried data with regular expressions as shown in Example 33. This preprocessing step can be run in intervals to keep the RDF graph synchronized with the data source (theoretically also transparent mediator-like approaches that expose data sources as RDF graphs like [Biz03] could be used that operate directly on top of the underlying data). Note that this way Deep Web data sources that would normally have been accessed via DWQL can now be accessed via WSQL. When this newly converted WSQL source is now queried (e.g. in a CCS process) at runtime (cf. Figure 6.2), the Generic Request Handler (GRH) of the caller (e.g. the CCS engine) sends a WSQL query request to the WSQL engine (2) that fills a SPARQL view (which will be introduced next) and queries the RDF graph (3). Finally, the result tuples are sent back to the caller (4).
109
110 SPARQL Views Since static Web data sources are now essentially RDF graphs, they can be queried by SPARQL. To incorporate SPARQL queries as first class citizens in the WDSDL (Web Data Source Description Language, cf. Chapter 4) world, the WDSDL ontology is extended with a new type of view that is executed by the WSQL engine:
1 2 3
@prefix rdfs: . @prefix wdsdl: . @prefix xs: .
4 5 6 7
# Classes wdsdl:RDFDataSource rdfs:subClassOf wdsdl:WebDataSource. wdsdl:SPARQLView rdfs:subClassOf wdsdl:BaseView.
8 9 10 11
# Properties wdsdl:usesGraph rdfs:domain wdsdl:SPARQLView; rdfs:range xs:anyURI. wdsdl:hasTemplate rdfs:domain wdsdl:SPARQLView; rdfs:range xs:string.
The respective graph is identified by the wdsdl:usesGraph property (akin to the from clause in SPARQL). The SPARQL query that extracts the desired variable bindings from the RDF graph is provided by the wdsdl:hasTemplate property. Here, the idea is to use a fixed query structure where placeholder variables are replaced with input variable bindings before the SPARQL query is evaluated. These placeholder variables are identified by a leading “§”2 and are replaced by the bound values of the input variables with the same name (i.e. the wdsdl:names of the tags of the relevant input variables coincide with the placeholder variables as further demonstrated in Example 33). This way, the signature of a SPARQL view can be represented as (qout ) ← wSPARQL (qin ) where wSPARQL is the unique ID of the query. Note that due to the way SPARQL views are defined, they automatically can be used in derived views, their variable usage characteristics can be analyzed by the GRH, etc. Example 33 TheAirDB3 is a database containing information about airports and airlines that is accessible via a Web frontend. It stores comprehensive data about connections between airports, as well as nearby railway stations. Due to the fact that the Web presentation is based on a template whose slots are filled dynamically on the fly with data from a database, the HTML structure is amenable to automatic processing by using regular expressions. The therein contained information is relatively stable and does not change often and therefore it would be beneficial to harvest the information once and convert it into an RDF graph. Fortunately, the URL naming scheme is also based on a template: 2
Generally speaking, any ASCII character could be used that is not part of URIs and the SPARQL core language, such as “$” or “?”. 3 http://theairdb.com
110
Chapter 6. The Source Layer
111
http://theairdb.com/airports/???.html
where ??? is the three letter IATA code of the airport. By sampling the search space and trying out all 263 = 17576 possible URL combinations, the relevant HTML pages can = 58.9%). Afterwards the HTML pages are easily be downloaded (the success rate is 10350 17576 processed offline by extracting the relevant information with regular expressions and the data is transformed to an RDF graph adhering to this RDFS vocabulary:
1 2 3 4
@prefix @prefix @prefix @prefix
owl: rdfs: travel: xs:
. . . .
5 6 7 8
# Classes travel:Airport a owl:Class. travel:RailwayInfo a owl:Class.
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
# Properties travel:iataCode rdfs:domain travel:Airport; rdfs:range xs:string. travel:icaoCode rdfs:domain travel:Airport; rdfs:range xs:string. travel:airportName rdfs:domain travel:Airport; rdfs:range xs:string. travel:runwayLength rdfs:domain travel:Airport; rdfs:range xs:int. travel:runwayElevation rdfs:domain travel:Airport; rdfs:range xs:int. travel:inCity rdfs:domain travel:Airport; rdfs:range xs:string. travel:inCountry rdfs:domain travel:Airport; rdfs:range xs:string. travel:worldAreaCode rdfs:domain travel:Airport; rdfs:range xs:int. travel:gmtOffset rdfs:domain travel:Airport; rdfs:range xs:string. travel:servedDestinations rdfs:domain travel:Airport; rdfs:range xs:int. travel:flightTo rdfs:domain travel:Airport; rdfs:range xs:string. travel:nearestRailwayStation rdfs:domain travel:Airport; rdfs:range travel:RailwayInfo. travel:name rdfs:domain travel:RailwayInfo; rdfs:range xs:string. travel:railwayCode rdfs:domain travel:RailwayInfo; rdfs:range xs:string. travel:distance rdfs:domain travel:RailwayInfo; rdfs:range xs:int.
Two things are noteworthy regarding the chosen RDFS vocabulary: first, it is “shallow” from an ontological point of view, i.e. cities and countries are represented as strings instead of classes. The reason for this is that the data source is only used for accessing information and not for applying reasoning tasks on top of the data. Second, the property travel:flightTo acts as a foreign key to the travel:iataCode property of another airport. As a next step, the associated SPARQL view is described in a WDSDL file:
111
112
1 2
@prefix xs: @prefix wdsdl:
. .
3 4 5 6 7 8 9 10 11 12
a wdsdl:RDFDataSource; wdsdl:baseURL ; wdsdl:providesView ; wdsdl:hasTag _:iataCode, _:icaoCode, _:airportName, _:runwayLength, _:runwayElevation, _:city, _:country, _:worldAreaCode, _:gmtOffset, _:servedDestinations, _:connectionTo, _:railwayStation, _:distToRailwayStation.
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
_:iataCode a wdsdl:Tag; wdsdl:name "iataCode" ; wdsdl:datatype xs:string. _:icaoCode a wdsdl:Tag; wdsdl:name "icaoCode" ; wdsdl:datatype xs:string. _:airportName a wdsdl:Tag; wdsdl:name "airportName" ; wdsdl:datatype xs:string. _:runwayLength a wdsdl:Tag; wdsdl:name "runwayLength" ; wdsdl:datatype xs:int. _:runwayElevation a wdsdl:Tag; wdsdl:name "runwayElevation" ; wdsdl:datatype xs:int. _:city a wdsdl:Tag; wdsdl:name "city" ; wdsdl:datatype xs:string. _:country a wdsdl:Tag; wdsdl:name "country" ; wdsdl:datatype xs:string. _:worldAreaCode a wdsdl:Tag; wdsdl:name "worldAreaCode" ; wdsdl:datatype xs:int. _:gmtOffset a wdsdl:Tag; wdsdl:name "gmtOffset" ; wdsdl:datatype xs:string. _:servedDestinations a wdsdl:Tag; wdsdl:name "servedDestinations" ; wdsdl:datatype xs:int. _:connectionTo a wdsdl:Tag; wdsdl:name "connectionTo" ; wdsdl:datatype xs:string. _:railwayStation a wdsdl:Tag; wdsdl:name "railwayStation" ; wdsdl:datatype xs:string. _:distToRailwayStation a wdsdl:Tag; wdsdl:name "distanceToRailwayStation" ; wdsdl:datatype xs:int.
34 35 36 37 38 39 40 41 42 43
a wdsdl:SPARQLView; wdsdl:hasInputVariable _:iataCode; wdsdl:hasOutputVariable _:icaoCode, _:airportName, _:runwayLength, _:runwayElevation, _:city, _:country, _:worldAreaCode, _:gmtOffset, _:servedDestinations, _:connectionTo, _:railwayStation, _:distToRailwayStation; wdsdl:usesGraph ;
112
Chapter 6. The Source Layer
44 45 46
113
wdsdl:hasTemplate "prefix travel: prefix rdf:
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
select ?icaoCode ?airportName ?runwayLength ?runwayElevation ?city ?country ?worldAreaCode ?gmtOffset ?servedDestinations ?connectionTo ?railwayStation ?distanceToRailwayStation where { ?airport travel:iataCode §iataCode optional { ?airport travel:icaoCode ?icaoCode } optional { ?airport travel:airportName ?airportName } optional { ?airport travel:runwayLength ?runwayLength } optional { ?airport travel:runwayElevation ?runwayElevation } optional { ?airport travel:inCity ?city } optional { ?airport travel:inCountry ?country } optional { ?airport travel:worldAreaCode ?worldAreaCode } optional { ?airport travel:gmtOffset ?gmtOffset } optional { ?airport travel:servedDestinations ?servedDestinations } optional { ?airport travel:flightTo ?connectionTo } optional { ?airport travel:nearestRailwayStation ?rs. ?rs travel:name ?railwayStation. ?rs travel:distance ?distanceToRailwayStation } }".
There is a close connection between the above WDSDL description and the associated RDFS vocabulary of the RDF graph. The data types of the tags coincide with the data types in the RDFS vocabulary of the correlated properties. The WDSDL description defines the view (icaoCode, airportName, runwayLength, runwayElevation, city, country, worldAreaCode, gmtOffset, servedDestinations, connectionTo, railwayStation, distanceToRailwayStation) ← getAirportInfoByIATACode(iataCode). Now, when this view is accessed with the following input tuple (iataCode/“HOQ”), the SPARQL template (identified by wdsdl:hasTemplate) is
instantiated like this:
1 2
prefix travel: prefix rdf:
3 4 5 6 7 8 9
select ?icaoCode ?airportName ?runwayLength ?runwayElevation ?city ?country ?worldAreaCode ?gmtOffset ?servedDestinations ?connectionTo ?railwayStation ?distanceToRailwayStation from where { ?airport travel:iataCode "HOQ"
113
114
10 11 12 13 14 15 16 17 18 19 20 21 22
optional optional optional optional optional optional optional optional optional optional optional
{ ?airport travel:icaoCode ?icaoCode } { ?airport travel:airportName ?airportName } { ?airport travel:runwayLength ?runwayLength } { ?airport travel:runwayElevation ?runwayElevation } { ?airport travel:inCity ?city } { ?airport travel:inCountry ?country } { ?airport travel:worldAreaCode ?worldAreaCode } { ?airport travel:gmtOffset ?gmtOffset } { ?airport travel:servedDestinations ?servedDestinations } { ?airport travel:flightTo ?connectionTo } { ?airport travel:nearestRailwayStation ?rs. ?rs travel:name ?railwayStation. ?rs travel:distance ?distanceToRailwayStation } }
and executed on top of the graph that is addressed by wdsdl:usesGraph, yielding: ?icaoCode
?airportName
...
?city
?country
?connectionTo
"EDQM"
"Hof"
...
"Hof"
"Germany"
"MLA"
which is transformed to the output variable bindings that are then returned. The match between the SPARQL result variables and the output variables of the SPARQL view is achieved by using the wdsdl:names of the tags as identifiers of the SPARQL result variables. The downside to converting data sources into SPARQL views is that it is only applicable, if the validity of the data is long-lasting. For more short-lived data, such as stock market rates, the mapping proposed above is not applicable. Here, an online access to the Web data sources is required, as presented in the next section.
6.3. Dynamic Deep Web Sources Dynamic Deep Web data sources consist of dynamically generated result pages of numerous databases, which can be queried via Web forms. Since they are buried behind humane interfaces that require active interaction with the user to reach the desired result pages, these pages cannot be reached by following links from other pages and it is therefore challenging to access their content. Figure 6.3 shows the general interaction pattern between a user and the Deep Web site. The user fills out the form with the desired information (1), and the Web form is sent to the server where it is transformed into a database query. In this phase it is possible that the system needs further user input due to ambiguity in the underlying data, e.g. there might be too many results for a query, and the user has to provide further information on intermediate pages (2). Finally, the Web server has gathered all the necessary information to generate the result page and it is delivered to the user (3).
114
Chapter 6. The Source Layer
115
Figure 6.3 Accessing a Deep Web site Web Browser (1) (2)
Send Web form
Web form Web Server
User
Query
DB
Intermediate pages p1 , . . . , pn
(3)
Generate HTML page
Result pages
A study performed in [HPZC07] discovered an exponential growth and subject diversity of these Deep Web sites (= Deep Web data sources). Among others, they arrived at the following conclusions: • There are approximately 43000 – 96000 Web-accessible databases, • The Deep Web is 400 – 500 times larger than the Surface Web4 , • 95% of the available data and information on the Deep Web is freely available. Taking into account this vast amount of high-quality data, which is geared towards human visitors, it is not surprising that many different research questions are actively pursued in this area, e.g. vertical search engines [HMYW04, NWM07, JYL+ 09]. Section Structure The structure of this section closely follows the interaction pattern in Figure 6.3: Section 6.3.1 describes the analysis of Web forms, which is required to automate the filling out and submission of the frontend form (cf. (1) in Figure 6.3). Section 6.3.2 presents a framework for recording and replaying the actions of the user on the intermediate pages that are required to reach a result page (cf. (2) in Figure 6.3). Section 6.3.3 then deals with the labeling and extraction of the data records in the result page (cf. (3) in Figure 6.3). Finally, Section 6.3.4 explains how the three tasks are implemented in the DWQL engine.
6.3.1. Form Analysis Web forms are omnipresent: whether the user searches for information on Google5 , participates in an online vote, or comments on an entry in a blog, she always provides information via filling out and submitting a form. On a more technical level, each input element (in the context of this thesis, all elements in the form that can be provided with a value, e.g. 4
Here, Surface Web refers to static and publicly available Web pages, which contain links to other pages and can be represented as a directed graph. This Web graph can be be traversed by crawlers (also known as spiders) and the found pages are then traditionally indexed by search engines. 5 http://www.google.com
115
116 checkboxes, are referred to as input elements) of a Web form is associated with a unique ID and on submission of the form the value assignments are encoded as either GET or POST HTTP request methods [FGM+ 99]. Figure 6.4 Web form with HTML representation
Web form First Name:
Thomas
Last Name:
Hornung
Go!
HTML code
First Name:
Last Name:
Figure 6.4 shows an example of a simple Web form. The unique ID for the input element labeled First Name is s1 and the one for Last Name is s2. Thus, the associated HTTP GET request looks as follows:
1 2 3 4 5
GET /search.cgi?s1=Thomas&s2=Hornung HTTP/1.1 Host: www.example.org User-Agent: Mozilla/5.0 Accept: image/gif, image/jpeg, */* Connection: close
The remainder of this section describes how the hidden input signature of Web forms is made explicit in this thesis. The idea is that users label the relevant input elements as described below and thus provide a mapping from tags onto input elements. Also the issues of how dependencies between these input elements are discovered, and how the initial HTTP GET or POST request is generated based on the collected data are discussed. Labeling of Input Elements Initially, for each new Web page, all occurring forms with all input elements, IDs and the range of legal values – i.e. for dropdown menu lists, this would be the set of legal options – are stored in a database for later analysis. Afterwards, the user can load the desired form and label the input elements, e.g. in Figure 6.5 the maximum price the visitor is willing to pay for a used car has been labeled price-to. The labeling of the Web forms is inspired by the idea of social bookmarking [HKGM08]: each
116
Chapter 6. The Source Layer
117
user has a personal, evolving vocabulary of tags. These tags correspond to the ones that are also used in the WDSDL description of the Deep Web data source as input tags, where the label of the tag corresponds to the wdsdl:name property value (cf. Chapter 4). Figure 6.5 Labeling of input elements car-brand
car-model
price-from
price-to
radius
zip-code
Example 34 Figure 6.5 shows the search Web form of the car meta seach engine http: //www.autoscout24.de. The upper corner shows the user vocabulary – i.e. the tags (or tag labels) she has used in the past – where the size of the tag is determined by the frequency they have been used before. The first task is to associate the tags with their corresponding input elements, which is done by dragging the tag onto the associated element. Afterwards, another window is opened, where the user is prompted for further context information, such as the format, datatype, dimension, and so on. Here again, by default the options that have been selected in the past most often are preselected as well. The result of this is the first part of the WDSDL description file:
1 2 3 4
@prefix @prefix @prefix @prefix
dim: curr: xs: wdsdl:
. . . .
5 6 7 8 9 10
a wdsdl:DeepWebSource; wdsdl:baseURL ; wdsdl:providesView ; wdsdl:hasTag _:tag1, _:tag2, _:tag3, _:tag4.
11
117
118
12 13 14 15 16 17
_:tag1 a wdsdl:Tag; wdsdl:name "car-brand" ; wdsdl:datatype xs:string. _:tag2 a wdsdl:Tag; wdsdl:name "car-model" ; wdsdl:datatype xs:string. _:tag3 a wdsdl:Tag; wdsdl:name "price-from" ; wdsdl:datatype xs:int; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:tag4 a wdsdl:Tag; wdsdl:name "price-to" ; wdsdl:datatype xs:int; wdsdl:dimension dim:price; wdsdl:unit curr:EUR.
18 19 20
a wdsdl:DeepWebView; wdsdl:hasInputVariable _:tag1, _:tag2, _:tag3, _:tag4.
Note how the wdsdl:name property values match the tag labels used during the labeling and that the blank node identifiers are generated by appending a counter variable to the base identifier “tag”. So far, only the input signature of the view has been defined; the view URI has been declared by the user at the beginning of the acquisition process. Additionally, the mapping from tags to input elements is stored in a database. After the user has labeled all input elements required for her view, the system checks if there are any dependencies between the input elements, i.e. if they have to be specified in a specific order. In the car meta search engine in Example 34, each car model depends on its car make. The other input elements are static, i.e. they do not change if one of the other input elements is changing. The labeled input elements mark the input signature of the source and consequently only these elements can later on be used for querying. Dependency Check of Input Elements The dynamic and static combinations are determined automatically after the user has finished labeling the desired input elements based on the following idea: the first dropdown menu6 is modified and afterwards it is checked for all other labeled dropdown menus, if the available options have changed. If this is the case, then the dependent dropdown menu is modified to uncover layered dependencies and the dependent menu is marked as dynamic. After all dropdown menus have been checked, all menus that have not been identified as dynamic are marked as static. To avoid loops, only possible dropdown menus that have not participated in a dependency in the current cycle before are checked, e.g. in Example 34 the car brand would not be considered during the check for the dependent car model input element. The resulting static and dynamic dependencies for the running example are shown in Figure 6.6 and 6.7, respectively. After the dependency check, the form is submitted and either a POST or a GET HTTP request is generated, which encodes the value assignments for the input elements. On submission of the form, the request URL is stored as (basepart, searchpart) tuple7 . 6
Only dropdown menus are currently considered as candidates for dynamic elements, all other input element types are assumed to be static by default 7 According to [BLMM94], URLs are of the form http://:/? where the part left of the “?” is referred to as basepart in the remainder. For POST requests, the searchpart is encoded in the body of the HTTP request.
118
Chapter 6. The Source Layer
119
Figure 6.6 Relation tree for static input elements static
Attribute name
Options
radius
...
10km
...
5km
500e
price-to
...
1000e
...
Figure 6.7 Relation tree for dynamic input elements dynamic
car-brand
car-brand
car-model
A2
Audi
...
A3
...
320
BMW
...
325
...
VW
Golf
Polo
...
...
Additionally, the action attribute of the form, and the specific value assignments, which are later required for building new requests offline are also stored in a database. Simulation of Web Form Behavior Suppose, the user provided the (MARS) input variable bindings {(car-brand/“BMW”, car-model/“850”)} for the meta car search engine. Since Deep Web sources often do not have a strict user interface, all input tags are considered optional by the DWQL engine and all queries operate on a best-effort basis. In this case, no inputs have been provided for price-from and price-to, and as a result they are ignored for this query. Assume the following stored request URL: (http://www.autoscout24.de/List.aspx, vis=1&make=9&model=16581&...)
i.e. the searchpart is:
119
120
vis=1&make=9&model=16581&...
The unique ID of the car-brand in the Web form is make and the ID of the car-model is model. Internally, the associated values are mapped to integer IDs, and thus the variable assignments first need to be matched to their associated integer ID, i.e. BMW corresponds to 13, and 850 to 1664. This can be done offline, since the internal integer IDs have been stored alongside their human-readable string labels in the database during the initial acquisition of the Web form. After the mapping, the two key-value pairs can be inserted in the original search part, resulting in the new search part: vis=1&make=13&model=1664&...
Note that the vis=1 part at the front has been left unchanged. This is necessary, because usually the search part contains additional parameters that are required in order to process the request. If a value can not be matched exactly, all available entries are compared with a string distance metric [CRF03], and the highest-ranking value is used. Depending, if a POST or GET request is required, the (Web form) variable bindings are either encoded in the body of the message or directly in the URL. A more elaborate presentation of the associated issues can be found in [WH08a, WH08b]. After the HTTP request is sent to the server, either directly the result page is returned and the data records are extracted, labeled and returned (cf. Section 6.3.3), or alternatively an intermediate page is returned. In the latter case, the Page-Keyword-Action (PKA) paradigm is used to navigate to the result page as described in the next section.
6.3.2. Deep Web Navigation One of the crucial notions for automating the navigation to a result page is the navigation model. Based on this model, the system determines, if it has already reached the result page or if it is still on an intermediate page. Additionally, the model reflects the actions, which should be performed for a specific intermediate page, e.g. to click on a link or fill out a new form. The key idea of the Page-Keyword-Action (PKA) paradigm is that the system first determines its location (intermediate vs. result page) by looking for a keyword phrase in the page and then invokes a series of actions, if required. Overview The overall navigation process is illustrated in Figure 6.8: the input variable bindings are mapped to the according internal input element IDs. If the form contains dynamic input elements for which input bindings are provided, it is first checked if they are valid (i.e. can be matched to a path in the associated dependency tree, cf. Figure 6.6 and 6.7). If so, the form is subsequently filled out and submitted with these combinations, which yields a new Web page. Alternatively, the information obtained during form analysis
120
Chapter 6. The Source Layer
121
Figure 6.8 Deep Web navigation process Web form field
Submit form
Found keyword phrase?
HTML page
Yes
Execute associated actions
Intermediate page
No
Result page
Return HTML source code
is used for directly generating the POST/GET request URL offline. For this new Web page, it is checked whether a registered keyword phrase can be found. If so, the associated actions are performed and again a new Web page is returned and the cycle starts again. This cycle continues as long as a new Web page can be identified as an intermediate page. To avoid an infinite loop, only different intermediate pages are allowed to occur in one invocation, the rationale being that once the same intermediate page is found again, a cycle in the intermediate pages is detected and an empty result is returned. If no keyword can be found on the current Web page, the system detects a result page and returns the HTML source code of the page that is then processed by the Web data extraction component (cf. Section 6.3.3). Intermediate Page Keyword Deep Web pages are typically created dynamically. As a consequence, they are realized by filling a predefined presentation template with data from a database on-the-fly. This can be leveraged to identify intermediate pages by looking for “constant” elements, which are part of the template and are identical between different manifestations. After the form analysis is finished, the user can iteratively submit the form with different options. If a specific input value combination leads to an intermediate page, she can identify the relevant keyword as follows. If she has already reached a result page for a value combination, no further user interactions are required. Note that as long as she is in the context of the currently active form, she can also access a series of intermediate pages and for each page specify a series of actions, which will be described shortly. The identification of an intermediate page is done via a static text field. The reason is that it can be included in many HTML elements, e.g. the
...
,
...
, and many other HTML tags, and given the template assumption they serve as a sufficient discriminatory factor. Of course, other more advanced techniques based on visual markers on the page or more IR-related techniques, such as text classification approaches [NMTM98], could be used in this context as well. However, the overall approach remains the same and thus far the proposed technique has proven to be sufficient to identify intermediate pages appropriately. For some Deep Web sites it does not help to identify intermediate pages, e.g. the car meta search engine described in Example 34 usually either directly yields a result page or is missing salient inputs and although an intermediate error page is returned in this case,
121
122 no further inputs are available and an empty result is returned. This could be alleviated by requesting the desired inputs from the user and then continuing the navigation phase, but since this thesis is focused on automatic processing of Deep Web sources, this solution was not considered. Figure 6.9 Intermediate page for the German railways portal
Example 35 Figure 6.9 shows an intermediate page for the view (deptTime, arrTime, duration, price) ← germanRailwaysByDept(start, dest, date, desiredDeptTime), where the input form already has been labeled with the tags start, dest, date, and desiredDeptTime, respectively. Now, if the user searches for a connection from Freiburg (Germany) to Stockholm (Sweden), she is prompted for her age to determine the appropriate price. This intermediate page can be detected by the keyword phrase “Bitte geben Sie hier ...(es gelten ...Altersgrenzen)” (indicated by a dashed box in Figure 6.9). The user marks this phrase, which will serve as a context anchor for recording actions, and the system monitors the actions the user takes to move to another page as explained next. Intermediate Page Actions By using keyword phrases, as discussed above, intermediate pages can be identified accordingly. To further mimic the user behavior, the user actions on the intermediate page – such as clicking on a link or filling out and submitting a new (intermediate) form – have to be performed to access the next, preferably result, page. In order to uniquely identify the appropriate HTML elements on which the stored actions should be executed, the path addressing language KAPath, which is a semantic subset of
122
Chapter 6. The Source Layer
123
XPath, is introduced. In order to access the appropriate action element, the system first finds the common ancestor of the keyword element (i.e. the HTML tag surrounding the keyword phrase) and the action element and then descends downwards in the action element branch. The assumption behind this is that the local path from the keyword element to the path element is more stable than to traverse the HTML tree of the intermediate page from the root node. Besides, the HTML element is guaranteed to be present, since otherwise the intermediate page would not have been detected. Afterwards, the registered actions are executed for the found action element. Overall, KAPath supports the following path expressions: • /node[@aname1 =avalue1 ]...[@namen =avaluen ]: The element in the DOM tree that matches the specified attribute name-value combinations of type node, • /P: Immediate parent node of current node, • /P::P: All (transitive) parent nodes of current node, • /P::P/node[@aname1 =avalue1 ]...[@namen =avaluen ]: The first found parent node in the DOM tree that matches the specified attribute name-value combinations starting from the current node and is of type node, • /Child: Immediate child nodes of the current node, • /Child::Child: All (transitive) child nodes of the current node, • /Child::Child/node[@aname1 =avalue1 ]...[@namen =avaluen ]: The first found child node in the DOM tree that matches the specified attribute name-value combinations starting from the current node and is of type node. Figure 6.10 shows an example how the associated action element in the page can be referenced with respect to the page keyword with a KAPath expression. Here, the tbody node is the first common parent node for both (keyword and action) elements. Therefore, the system automatically generates a KAPath expression that allows optional intermediate elements between the keyword and the first common parent node. For finding the correct action element, it is crucial to consider its attributes as well. If the desired action elements have no (e.g. links) or dynamic attributes (e.g. visibility), additionally the absolute path from the keyword to the action element and the tree structure starting from the common parent is stored. Another situation where this is beneficial is when the HTML page structure has changed and the common parent node is still on the same level in the DOM tree but in another branch. The tree structure is helpful if there are changes on the way downwards from the common parent node. For a more in-depth discussion of the algorithms that are used to find the appropriate action elements starting from the keyword element, the interested reader is referred to [WH08b, Wan08]. Recording User Actions Based on the user’s browsing behavior, the system can generate the complete navigation model. First, she identifies the keyword for an intermediate page by clicking on the relevant text in the Web page. Then, the system determines the closest surrounding HTML element and stores the relevant context information. Afterwards, the system monitors the user behavior and stores each action she performs until she reaches
123
124 Figure 6.10 KAPath expression allowing optional HTML elements tbody
optional
tr
td
td
tr
td
td
td
h2
Absolute path KAPath Tree structure
#text
input
Keyword phrase
Action element
/ParentNode/ParentNode/ParentNode/Child[1]/Child[1]/Child[0] /P::P/tbody[@a1=v1][@a2=v2]/Child::Child/input[@a3=v3] tbody(tr,tr)/tr(td,td)/td(input)
a new page. Based on this action log, the system can automatically determine the paths and tree structures for each action. To ease the recording of the user actions, an HTML action language called WScript has been implemented, which is similar to Chickenfoot [BWR+ 05]. This intermediate script language is convenient, because in order to find the HTML elements on which the actions have to be invoked, the navigation structures defined earlier are supported by WScript. Each action in the script language consists of a navigation and (if applicable) an input part forming the following supported actions: • Clicking on links: link(absolute path, KAPath, tree structure), • Entering text in input fields: enter(absolute path, KAPath, tree structure, element name, element ID, input value), • Selecting a checkbox or radio button: click(absolute path, KAPath, tree structure, element name, element ID), • Selecting an option from a dropdown menu: dropdown(absolute path, KAPath, tree
124
Chapter 6. The Source Layer
125
structure, option text, element name, element ID), and
• Submitting forms: click(absolute path, KAPath, tree structure, element name, element ID). The element name and ID that are present for some actions are identical to the name and ID attributes of the underlying HTML element and are used to first find the relevant HTML element. If the lookup by ID and name fails, the search for the action element continues with the KAPath as usual. For example the following action expression would enter the string “Hallo World” into the text input field of the HTML tree in Figure 6.10:
1 2 3 4 5 6
enter(/ParentNode/ParentNode/ParentNode/Child[1]/Child[1]/Child[0], /P::P/tbody[@a1=v1][@a2=v2]/Child::Child/input[@a3=v3], tbody(tr,tr)/tr(td,td)/td(input), , % element name not available , % element ID not available Hallo World)
Together, the keyword phrase (with its associated element) and the associated actions form the navigation model for this intermediate page. For each Deep Web data source, multiple navigation models can be registered to enable the system to automatically navigate to a result page. The next section discusses how to extract (and label) the data records hidden in the thus found result pages.
6.3.3. Web Data Extraction and Data Record Labeling To enable easy and accurate labeling of the input form, the relevant tags can be dragged to the elements of the selected form as shown in Example 34. Since the relevant input elements are already clearly discernible, no further preprocessing in terms of visual presentation of the input form has to be done. The opposite is true for result pages: here, the visual presentation of the data records varies greatly. Figure 6.11 highlights two different layout styles of HTML result pages: the first is a tabular layout of results that closely resembles the way a database table is depicted in the relational model. Building on this analogy, a data record is a row in this table, where the (so far datatype-less) schema is given as (label1 , . . . , labeln ). The data records in the boxed layout correspond to the dashed boxes. Here, each row in the tabular layout is explicitly associated (by means of visual closeness) to the according label, which is clearly understandable for humans but poses more challenges for machines. On the HTML level, the tabular representation on the left side usually is achieved with an HTML
tag resulting (almost) in an XML-style structure that can be easily parsed and processed by machines. Yet, the boxed layout could be achieved (on the HTML) level with different means calling for a more generic approach to Web data extraction.
125
126 Figure 6.11 HTML result page layouts
label1 value11
... valuem1
... ... ... ...
labeln value1n
...
label1 :
value11
...
...
labeln :
value1n
... ... labeln : . . . label1 :
...
... ... ... labeln : . . . label1 :
valuemn
Tabular layout
label1 :
valuem1
...
...
labeln :
valuemn
Boxed layout
For this, the (already available) data extraction system ViPER [SL05] is leveraged, which suggests identifed data regions, i.e. regions in the Web page that contain data records with a similar pattern, with decreasing importance to the user based on visual information. In general, the foremost recommendation meets the content of interest and thus is suggested by default. But it is also possible to opt for a different region, if so desired. Regardless of the selection, ViPER always tries to convey the structured data into a tabular layout, similar to the one shown in Figure 6.11 on the left. This rearrangement enables to automatically clean the data and serves as a comfortable representation for labeling as illustrated in Example 36. Figure 6.12 Labeling of the HTML result page car-brand zip-code ps
car-model
price kw
price-from description picture
126
price-to
radius
kilometrage
registrationDate
city
detailURL
Chapter 6. The Source Layer
127
Example 36 In Figure 6.12, the result page of the search for a BMW 850 on the car meta search engine http://www.autoscout24.de, which has been introduced in Example 34, is depicted. At the top, the original result of the search is shown, where the ViPER system has highlighted the currently active data region in blue. At the bottom, the tabular representation of the therein contained data records is shown. Now, the columns can be labeled similar to the way the input form was labeled in Example 34. Currently, the user is about to label the column containing the cities where the cars can be bought with the tag city. Once the labeling is finished, the WDSDL description of this Deep Web data source is complete:
1 2 3 4 5 6
@prefix @prefix @prefix @prefix @prefix @prefix
dim: car: curr: travel: xs: wdsdl:
. . . . . .
7 8 9 10 11 12
a wdsdl:DeepWebSource; wdsdl:baseURL ; wdsdl:providesView ; wdsdl:hasTag _:tag1, _:tag2, _:tag3, _:tag4, _:tag5, _:tag6, _:tag7.
13 14 15 16 17 18 19 20 21 22 23 24
_:tag1 a wdsdl:Tag; wdsdl:name "car-brand" ; wdsdl:datatype xs:string. _:tag2 a wdsdl:Tag; wdsdl:name "car-model" ; wdsdl:datatype xs:string. _:tag3 a wdsdl:Tag; wdsdl:name "price-from" ; wdsdl:datatype xs:int; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:tag4 a wdsdl:Tag; wdsdl:name "price-to" ; wdsdl:datatype xs:int; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:tag5 a wdsdl:Tag; wdsdl:name "city" ; wdsdl:datatype xs:string. _:tag6 a wdsdl:Tag; wdsdl:name "price" ; wdsdl:datatype xs:int; wdsdl:dimension dim:price; wdsdl:unit curr:EUR. _:tag7 a wdsdl:Tag; wdsdl:name "ps" ; wdsdl:datatype xs:int; wdsdl:dimension dim:enginePower; wdsdl:unit car:ps.
25 26 27 28
a wdsdl:DeepWebView; wdsdl:hasInputVariable _:tag1, _:tag2, _:tag3, _:tag4; wdsdl:hasOuputVariable _:tag5, _:tag6, _:tag7.
The view findCarByPrice has the resulting signature (city, price, ps) ← findCarByPrice(carbrand, car-model, price-from, price-to). The minimalistic output signature (city, price, ps) is mainly due to reasons of brevity, but it also shows that albeit a Deep Web data source could return more information, it only is returned if the underlying source is labeled accordingly.
127
128 As a side-effect of the labeling of the HTML result page, labeling rules are created, as well as data cleaning rules that mimic the performed operations by the ViPER system and the fine-tuning by the user. The exact details of the rule generation and application, as well as the extraction of the Web data by a wrapper are outside the scope of this thesis. More information pertaining to these issues can be found in [SL05, SHL06], and an in-depth discussion of the ViPER system in general can be found in [Sim08].
6.3.4. DWQL Engine Architecture So far, the acquisition of Deep Web data sources plus the simulation of user behavior and extraction of data records from HTML result pages has been presented in Sections 6.3.1 through 6.3.3. This section is devoted to the interaction of the navigation and data extraction component inside the DWQL engine during query time as depicted in Figure 6.13. Figure 6.13 DWQL engine architecture and cooperation with MARS DWQL engine
(9) GRH ECA/CCS engine GRH
(1)
In Queue DWN
(2)
(3)
(8) Out Queue DWN/ In Queue WDE
DWQL controller
Cache
(7)
Deep Web navigation component
(4) (5)
Out Queue WDE
(6)
Web data extraction component
Recall the discussion of the generic communication during query evaluation that was shown in Figure 5.4 on page 105 in Chapter 5. Figure 6.13 can be seen as an extended version of this figure where the internal communication inside the DWQL engine is also shown. Here again, the caller (e.g. the ECA or CCS engine) sends the input variable bindings and the respective query language fragment to the DWQL engine via its GRH (1). The DWQL controller uses this information and creates a navigation task that contains the ID of the requested view as well as the input variables in an internal format, and inserts the navigation task into the in-queue of the Deep Web navigation component (2) (the DWQL engine operates highly decoupled and its constituents communicate via message queues). The Deep Web navigation component checks its in-queue, and if tasks are available, they are processed (cf. Section 6.3.2) (3). Once the HTML code of the result page has been found, a new task is generated for the Web data extraction component and inserted in the out-queue of the navigation component, which also acts as in-queue of the data extraction component (4). The data extraction task directly contains the HTML source
128
Chapter 6. The Source Layer
129
code as payload; only inserting the URL of the result page is not sufficient, since some Web servers always display the same URL during navigation and thus the wrong page would be used during data extraction – this is e.g. the case for AJAX8 -based applications [DFK+ 09]. The Web data extraction component also checks in intervals for new tasks, and if new extraction tasks are available, the relevant data records are extracted and the results put in the out-queue of the Web data extraction component (cf. Section 6.3.3) (5-6). The DWQL controller then takes the extracted data records, converts them from the internal format to the respective output variable bindings and submits the results to its GRH (7-8) which returns them to the caller (9). As already mentioned above, this approach is very de-coupled and allows for maximum parallelism, since the controller as well as the navigation and the data extraction components are multi-threaded and run in parallel. Also, due to the separation of the two components they can easily be exchanged, as long as they support the necessary interfaces. It would also be possible to use several different navigation and data extraction components at the same time. This allows to adequately address the special needs of Deep Web data sources. In Section 5.3 general features of the WSQL and DWQL engine architecture have been discussed. Among others, some ethical aspects of parallel processing of Web data sources have been briefly mentioned. Note that this directly applies to the DWQL engine, e.g. the number of parallel extraction tasks for one data source is governed by the same factor k that was mentioned there. The use of a cache was already discussed in Chapter 5 and thus it is only shown in dashed lines in Figure 6.13 to indicate its position in the DWQL engine.
6.4. Related Work The related work can be divided into two areas: Web form analysis and Deep Web navigation (1), and Web data extraction and data record labeling (2). (1) is covered below in-depth, and the reader is referred to [LRNdST02, CKGS06] for surveys on (2). Web Form Analysis & Deep Web Navigation A closely related research area are Deep Web crawlers [RGM01, LdSGL02, LESY02, LdSGL04, dCFS04, DFK+ 09] that strive to (semi-)automatically gather all the hidden data by consecutively filling out Web forms with different value combinations. Seminal work in this area was done in [RGM01], where a task-specific, and human-assisted approach to crawling the Deep Web was proposed. This includes a generic high-level architecture for Deep Web crawlers and a set of performance metrics to evaluate the efficiency of implementations. Given a set of input values for different categories (that are initially provided by the user but can also be consecutively acquired during crawling), the crawler first extracts the labels of the Web forms by a heuristic that leverages the physical layout of the Web form. Label candidates are chosen by vertical and horizontal closeness to the respective input element and are then ranked 8
Asynchronous JAvaScript and XML [Pau05]
129
130 based on position, font size, etc. (cf. [RGM00]). Using a string edit distance metric, a match is computed between the categories and the highest-ranked label, and the associated words of the best-matching category are chosen (for input elements requiring textual input). The proposed method does not support dependencies between input elements, but reports a high accuracy of correctly extracted labels, which could be used for suggesting tag labels to the user for labeling of the input elements (cf. Section 6.3.1). While [RGM01] requires a rendering of the input page to determine the distances, the SmartCrawl system [dCFS04] solely relies on a heuristics that can be evaluated using the HTML representation and a certain format to which labels have to adhere. Their experiments show slightly lower accuracy than [RGM01] but might still be sufficient for suggesting tag labels to the user. Also, the work reported in [LdSGL02, LdSGL04] approximates the position of labels by using the HTML tag structure. They differentiate two fixed navigation patterns for the interaction with Web forms, the first assumes that after form submission directly a result page is found, and the second allows for one intermediate page (for refining the search). Intermediate pages are detected by checking for a result page. This is inverse to the way intermediate pages are identified in Section 6.3.2; there, intermediate pages are identified by keyword phrases. Additionally, in this thesis an arbitrary amount of intermediate pages is allowed, where each intermediate page has a set of associated actions to reach the next (ideally) result page, that is identified by not containing any known keyword phrases. In [LESY02] a strategy is proposed that tries to restrict the search space (all possible combinations of input elements with bounded domains – ignoring text input fields) to the most promising combinations. By converting the fixed domains of the n considered input elements into the elements of an n-dimensional array, the search space is sampled by distributing the chosen input combinations evenly across this array using random sampling with maximal coverage. Afterwards these input combinations are queried until a user-defined goal criterion, e.g. percentage of data retrieved, has been reached. This is similar to the dependency check described in Section 6.3.1, whereas there the whole search space is considered, since the focus of the analysis was on reproducing the form behavior offline and not on active gathering of data with minimal effort. Finally, for crawling AJAX-based Web pages, among others, the issue arises that the URL does not change after an AJAX invocation. This is addressed in [DFK+ 09], where an AJAX crawler is introduced, that models the Web as a hypergraph, where the edges are the normal HTML links and the nodes are automata that represent the AJAX interactions on a Web page. In this transition graph, a node represents an application state and an edge represents a transition between two states triggered by an event. The analysis of a single Web page is similar to the way the dependencies between input elements are analyzed: all events in a page are triggered and the automaton is built. However, forms that require user input are not considered. Another complementary line of research is geared towards the automatic construction and labeling of the interface schema of Web forms [HMYW05, DYM06]. The HTML
130
Chapter 6. The Source Layer
131
standard [RHJ99, PAA+ 02] provides a special ... tag that allows to assign a machine-understandable label with a form input element but it is seldomly used in practice and thus there is often no formal association of labels with their respective elements available. The WISE-iExtractor [HMYW05] addresses this issue (among others) by declaring a high-level interface schema model, and proposes also a layout-expressionbased form extraction approach called LEX. The schema model captures the relationships between elements (e.g. range type, part type, etc.), and the logical relationships between attributes (i.e. how attributes are combined to form a query), as well. The schema also caters for the datatype and the unit of an input element and the approach as such can serve as an aid for deriving the input signature of an WDSDL specification (cf. Example 34 on page 117). LEX is based on an approximation of the visual layout based on the HTML tag structure (i.e. no rendering of the Web page is required as in [RGM01]) and five heuristic measures. The WISE-iExtractor does not analyze JavaScript interactions and no functional dependencies between input elements are considered. In contrast, [DYM06] assumes a set of query interfaces and a schema tree of an already integrated query interface (plus some additional data structures that are specific to their method) as inputs with the goal to label the integrated query interface. They also propose a strategy for discovering the semantic relationships between labels of different Web forms in the same domain. This strategy could be modified to identify meaningful tags in the user vocabulary or propose new meaningful tags for the labeling of input elements. The Metamorph project [HKB08] relies on manual tagging of form elements. They use an RDF/OWL ontology that captures the notions of a Web form, domains, actions and schema for queries, which are connected by a set of rules. The aim of this project is to generate maintainable vertical Deep Web search engines that only require the tagging of Web form elements with ontology concepts, which can be done by a service designer, and initially (for each domain) a domain subontology, which requires the skill of a trained ontology engineer. The idea of manual tagging of Web form elements (and data records contained in result pages) for generating high-quality WDSDL descriptions (cf. Section 4 and 6.3.1) is also favored in this thesis, but the goal is to provide (Deep) Web data source views that are generic in terms of the WDSDL description and the high-level Deep Web access model. Most users like to store a pointer to Web pages they recently visited as bookmarks in their browser. This is only possible for static URLs; once the desired page is only accessible by performing a sequence of steps and filling out several Web forms along the way, storing a pointer to the result URL usually is futile. This problem is addressed by socalled smart bookmarks [AFKL00, FKL01, HM07, MPR+ 09b, MPR+ 09a, LNL+ 10] that store the actions a user performs and replay them later for the automated retrieval of the desired Web page. The WebVCR [AFKL00] system9 offers two modes: record and play. If the user selects the record mode, all user actions are tracked by attaching event handlers on all clickable elements in the page using LiveConnect [Fla98] until the user signals the 9
The name is inspired by the analogy to the way a regular video cassette recorder (VCR) operates.
131
132 system that the desired page has been reached. The actions are stored in the order in which they were performed and can be replayed in the play mode. The internal representation strongly mimics the HTML DOM tree structure of the HTML document at the time of recording but heuristics are presented for dealing with changes in the DOM structure during execution of a smart bookmark. The recording of user actions is similar to the one employed in this thesis, which also builds on top of LiveConnect (cf. [WH08b] for a more detailed discussion of the implementation details). Yet while the WebVCR system relies on a fixed order of actions that need to be executed in order, the herein chosen method is not dependent on a fixed order but uses the Page-Keyword-Action (PKA) paradigm to identify intermediate pages and then to perform the associated actions. Experiments reported in [Wan08] have shown this approach to be well-suited for most typical Deep Web pages. In a follow-up publication, the WebVCR system is used as a component to realize WebViews [FKL01]: customized views of Web content and services. This is achieved by executing the recorded smart bookmarks on a server that uses another component to extract a part of the DOM tree of the reached HTML page that is then sent to a mobile device. The intuition is that mobile devices in 2001 were not able to deal with complex Web interfaces and that bandwidth was an issue. By parameterizing smart bookmarks, more dynamic behavior can be achieved: the user is prompted on a Web page to provide the inputs for all encountered Web forms during the execution of the smart bookmark (where fresh values should be provided). A problem that can occur here, is that the Web sites need to be deterministic; if the choices result in an alternative navigation path, the WebVCR component is not able to navigate to the correct page. This is not a problem in the PKA paradigm, since explicit navigation dependencies are not important. The specification of the relevant subpart of the DOM tree is realized as XPath [CD99] expressions and the concept of a WebView can be seen as the analogon of a WDSDL view for humans, since no explicit extraction and labeling of the data records in the result page is considered; only a part of the HTML page is extracted and returned without further post-processing. The so far discussed smart bookmark systems were all pro-active, i.e. they required the user to signal when the recording of bookmarks starts and ends. Another avenue was taken with the Smart Bookmarks [HM07] and the ActionShot system [LNL+ 10], which both support the retro-active generation of smart bookmarks. The former continuously monitors and records the user actions during a single session, while the latter always records all user actions. Both systems support the concept of a natural language-like scripting language, where Smart Bookmarks relies on the Chickenfoot scripting language [BWR+ 05] and ActionShot is built on top of CoScripter [LHML08]. ActionShot mentions support for conditionals, i.e. that the execution of the recorded actions does not exactly have to be in the order as they were recorded, but Smart Bookmars still relies on a fixed order and just replays the sequence of actions. The history feature of the ActionShot system automatically discerns actions on different Web sites, which might be interesting to investigate for retro-actively mining the actions for intermediate pages.
132
Chapter 6. The Source Layer
133
The problem of AJAX has also been studied in the context of smart bookmarks in [MPR+ 09b, MPR+ 09a]. There, the authors propose to overcome the problem of event detection by changing the method such events are recorded. In their solution, the user can select the target of the event invocation by hovering the mouse over the object in the Web browser and then select the event that should be invoked from a list of available events. This way, the system knows, which event should be triggered on which object during replay time. The detection of the relevant object is based on the coordinates of the mouse pointer, where the DOM node is addressed by an automatically computed XPath expression. Now the problem arises how to detect the end of the effects of an event. This is addressed by registering an additional event listener for the node that captures the event that is to be triggered. Also, the asynchronous functions are re-implemented such that a counter is kept for each invocation that is incremented before the function is called and decremented once the result is available. The task of the newly inserted event listener is to poll this counter to make sure that all open asynchronous calls have been processed before the next recorded event is fired. This solution has one major drawback: only an expert can know what kind of event she wants to trigger; a casual user is not interested in the inner workings of her browser. As such, it would be interesting to combine the PKA paradigm, which is geared towards casual users, with their solution to allow for a hybrid acquisition procedure of the actions once an intermediate page has been identified by a keyword phrase, where expert users could opt for directly specifying the event that should be triggered. Note that the key idea of the PKA paradigm would still be intact, only the possibilities for recognizing and replaying events would need to be extended. More complex interaction patterns for Deep Web navigation are explored by workflowbased systems [DFKR99, BCL05, GZC07, MPR+ 08b, MPR+ 08a]. In [DFKR99] a layered architecture for querying Deep Web data sources is presented that aims at answering conjunctive queries over different Deep Web data sources. The virtual physical schema layer, which is concerned with the navigation to result pages (and the extraction of therein contained results), is most related to this chapter. Similar to smart bookmarks, users can record actions they perform on a Web site (semi-)automatically that are then converted into so-called navigation maps. Intuitively, navigation maps are directed labeled graphs, where the nodes are the pages that are encountered during the navigation procedure and the actions are added as labeled edges of the navigation map. From these navigation maps, executable navigation expressions in Transaction F-Logic [Kif95], which has support for recursion and a notion of ordering of events, are created. Follow-up work discusses the implementation-specific aspects of this layer [DYKR00] using the FLORA system [YK00], and presents an application for building Personal Information Assistants [JKL+ 04]. Since the PKA paradigm checks all keyword patterns for each new page, and therefore the previous state of the system is not relevant, it is not dependent on a rigid sequence of intermediate pages and consequently requires no complex navigation calculus. As a downside, only one outgoing edge – in analogy with navigation maps – is supported, i.e. for each intermediate page only one set of actions can be associated. So far, this has not shown to be an issue in navigating Deep Web sources but this could as well be solved
133
134 by introducing for each intermediate page a set of action records (each record consisting of possibly multiple actions), where each action record is associated with a condition akin to a local branching version of what ActionShot [LNL+ 10] proposes for the complete navigation sequence. The Deep Web navigation solution of the Lixto suite10 , described in [BCL05], uses the Mozilla Web browser11 as integrated component in a server architecture for replaying navigation steps and during the acquisition of new navigation models. Their process model includes document classes, linear actions, and wrap actions based on Elog programs [BFG01] combined with navigation actions and recursion. As future work they envision to investigate the benefit of using standard business process languages, e.g. WSBPEL [AAA+ 07] for Deep Web navigation. The PKA paradigm is also implemented by using the Mozilla browser in a server-style mode, where multiple tabs are managed by the system to support multi-threading inside the browser and parallelize the navigation of requests (cf. Section 6.3.4). The concept of document classes is akin to the notion of intermediate pages introduced in this thesis, although the detection of document classes is not presented in [BCL05]. A more recent overview of the Lixto suite with a special focus on the scalability of its architecture in a cloud computing [QLDG09, Cre09] setting can be found in [BGH09]. In [GZC07], a workflow description language based on JavaScript is proposed. It supports the recording of user actions and claims to implement all workflow patterns identified by van der Aalst in [vdAtHKB03]. Tasks are the basic blocks of the workflow language that can themself be scripts allowing for very expressive modeling of user interactions, even on AJAX-enabled Web pages. The extraction of data records from result pages can also be modeled as a task. As such, the workflow language is not constrained to Deep Web navigation but could be used for a light-weight implementation of the DWQL engine inside a Web browser (cf. Figure 6.13). Founded on an exhaustive study of real-life wrappers and also inspired by the van der Aalst workflow patterns [vdAtHKB03], a graphical workflow language for defining wrappers is described in [MPR+ 08b, MPR+ 08a]. The suggested patterns that are most relevant deal with conditional branching, exception management, parallelism, asynchronous events (e.g. AJAX), and sub processes. Their workflow model supports all these patterns and is based on a typed data model. As a special feature, they support reuse of already defined components both on the binary (executable activities) and the source level (templates for typical tasks such as accessing subsequent result pages by following the “next” link). The expressive power of their approach is geared towards modeling parts of business processes that take place on a single Web site, which is beyond the scope of this thesis. Here, Deep Web navigation is only used as means to an end: finding the result page for extracing data records, which is sufficiently covered by the PKA paradigm. There also exist vast amounts of further research in related areas, such as user activity 10 11
http://www.lixto.com http://www.mozilla.org
134
Chapter 6. The Source Layer
135
tracking [AWS06], or adapting screen readers to the Web 2.0 paradigm [HRBA09]. The former area is concerned with uncovering the “implicit interaction” of users on Web sites, such as how long does a user take to fill out a Web form? The technique used to gather the relevant data in [AWS06] is similar to the one proposed for WebVCR [AFKL00]; the major difference is that they use an HTTP proxy in the middle that preprocesses the HTML pages from a server and inserts the relevant event listeners as well as to log the user behavior. Besides, the focus is not on replaying events but on a detailed analysis of the user behavior on a Web site, including mouse movements, which are usually not considered for e.g. smart bookmarks. The latter area deals with the additional complexity introduced by Web 2.0 technologies that result in dynamically changing Web pages. These pose a significant problem for screen readers that transform the visual and textual information on a Web page to other modalities, e.g. for viewing and editing colored graphics on Braille devices [TE09]. The models that are researched there (cf. [HRBA09]) can provide meaningful insights for developing more accurate models for Deep Web navigation. Lastly, the problem of automatic maintenance of Web wrappers – with a special focus ´ ´ + 07]. on the navigation component of the wrapper – has been studied in [RALP06, PRA There, a navigation sequence is expressed as a list of interface events in the Navigation ´ + 02]. NSEQL shares similarities with the Chickenfoot Sequence Language (NSEQL) [PRA scripting language in that the interface events are also translated to a declarative macro language, which is also capable of identifying the respective input elements by context information such as the title of the element (instead of reverting to XPath expressions). However, Chickenfoot relies on a more robust algorithm for identifying the relevant input elements. ´ ´ + 07] presents a maintenance The maintenance system introduced in [RALP06, PRA task that is executed (only) once the system has detected a change in the navigation sequence, where three types of navigation sequences are distinguished: the query sequence (comparable to what has been referred to as Deep Web navigation so far), the “next interval” sequences (used for navigating to additional result pages), and the “more detail” sequences (used for finding additional information about a data record). Here, only the query sequence maintenance method is discussed, since the other two sequences fall into the realm of the Web data extraction and data record labeling component. Their algorithm has three inputs, most notably the old navigation sequence and a set of stored queries with associated results. The maintenance strategy is similar to the way a Deep Web crawler works, where the old navigation sequence is used to rank crawling steps and the query/result combinations help to test if the response page of a Web form is valid. So far, wrapper maintenance has not been studied in the context of the proposed approach in this thesis, but if a Deep Web source changes to an extent such that the herein proposed Deep Web navigation model can not reconcile the changes automatically, usually the whole wrapper (including the Web data extraction component) needs to be recreated.
135
136
136
Part IV. Combining Web Data Sources
137
Chapter 7. Tailoring CCS to Query Workflows There are many ways of trying to understand programs. People often rely too much on one way, which is called “debugging” and consists of running a partlyunderstood program to see if it does what you expected. Another way [. . . ] is to install some means of understanding in the very programs themselves. – Robin Milner (January 13, 1934 – March 20, 2010)
Contents 7.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2. Preliminaries: MARS & CCS . . . . . . . . . . . . . . . . . . . 141 7.2.1. The Process Model: Processes and their Constituents . . . . . . 142 7.2.2. State, Communication, and Data Flow via Variable Bindings . 143 7.3. RelCCS: The Relational Data Flow Process Language . . . . 144 7.3.1. Syntax and Semantics of RelCCS . . . . . . . . . . . . . . . . . 144 7.3.2. Recursive Processes in RelCCS . . . . . . . . . . . . . . . . . . 149 7.3.3. Data-Oriented RelCCS Operators 7.3.4. Technical Realization
. . . . . . . . . . . . . . . . 151
. . . . . . . . . . . . . . . . . . . . . . . 153
7.4. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.4.1. Bacon Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.2. Travel Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.1. Introduction Most of the information that is needed for daily tasks is available on the Web. So far, the access to a single Web data source has been studied. Yet, many information needs can only be satisfied by combining several Web data sources efficiently and appropriately in an automatic way. Efficiency does not necessarily mean millions of data items, but often a
139
140 relatively small number of items, scattered over multiple Web data sources, and to organize the process of combining, evaluating, making decisions, and interacting. Consider for example travel planning: not only the nearest airport to a destination has to be found, but depending on the airlines, different airports must be considered, and availability of flights has to be checked. Then, transportation from/to the airports, possibly provided by local railway companies, has to be arranged. Even employees of travel agencies usually process such enquiries manually, which requires a lot of time and it is potentially incomplete and suboptimal. Although the manual process follows a small number of common patterns (e.g. searching for paths in a transitive relationship distributed over several sources, like flight schedules and train schedules, with heuristics for bridging long distances vs. shorter distances, making prereservations, doing backtracking) it is hard to automatize it since the sources are not integrated, and the underlying formalism has to cover both procedural tasks and data manipulation tasks. Often it is easier to design the process how to solve such a problem than stating a single query. The source integration problem has been addressed in Part III of this thesis on the level of modeling and accessing Web data sources in a transparent way, yet the intelligent combination still remains to be solved. This technical environment together with the intrinsic complexity of the tasks requires for flexible query workflows using a generic data model and an extensible set of functional modules, including the ability to interact actively with remote services. Important basic functionality includes appropriate mechanisms to deal with information acquisition and target-driven information processing on a high level, like using design patterns for acting on graph-structured domains (cf. Section 8). Arguably, being a highly specific application, travel planning is not the only domain that benefits from a declarative formalism for query workflows. Example 37 illustrates the hurdles during information acquisition in a dynamic setting with multiple Web data sources with a more mundane use case. Example 37 In the mathematics community there is a well-known phenomenon called the “Erd¨os Numbers”, named after the immensely prolific mathematician Paul Erd¨os (March 26, 1913 – September 20, 1996), that captures the co-author distance (i.e. minimum number of edges) between nodes in a tree, where Erd¨os is at the root, e.g. the Erd¨os Number of the author of this thesis is at most 5 via the connection Thomas Hornung – Georg Lausen – Peter Widmayer – Nicola Santoro – Shmuel Zaks – Paul Erd¨os. A similar concept is used in the movie industry with the actor Kevin Bacon at the root of the tree, and the edges are motivated by co-appeareances in a movie1 . Now suppose, a movie aficionado wants to determine the Kevin Bacon number of her favourite actor based on the Internet Movie Database (http://www.imdb.com) where she can enter the name of her favorite actor and gets all the movies he/she has starred in (i.e. in WDSDL terms, the signature of the Web data source is (movie) ← actor2movie(actor)). Additionally, there is the possibility to search by movies to obtain the actors that played in that movie ((actor) ← movie2actor(movie)). Thus, she uses all the movies found in the first round and checks 1
The Web site http://oracleofbacon.org precomputes the shortest distances between all actors in the Internet Movie Database (http://www.imdb.com) offline based on a dump file in intervals.
140
Chapter 7. Tailoring CCS to Query Workflows
141
for each in turn, if Kevin Bacon occurred in one of them (which would mean that the Kevin Bacon number of her favorite actor is 1). Unfortunately, this is not the case and she starts the search again with all new actors she discovered in this round, and for each of these she has to check in which movies they starred and afterwards for each movie if Kevin Bacon was in one of them (which would give her favorite actor a Kevin Bacon number of 2) and so on. The underlying search pattern is a recursive invocation of the two Web data sources actor2movie and movie2actor until the desired result has been obtained, where partial results need to be propagated from one Web data source to the other. In the following, an approach is presented that satisfies the above requirements, e.g. it supports the search patterns needed for realizing Example 37 but also has the expressive power to tackle the complexity of more involved scenarios such as travel planning. The core aspects are the intertwined description of the control flow of the process (by a process algebra, e.g. the Calculus of Communicating Systems (CCS) by Robin Milner [Mil83]) and the handling of the data flow (based on the relational model), and the use of hetereogeneous atomic constituents like queries and actions in the workflow: CCS is extended to relational data flow, called RelCCS [HML09], and realized as a language in the MARS framework (cf. Section 3). RelCCS is complementary to the original rule-based MARS paradigm, and employs the functionality of the MARS framework as an infrastructure. The focus is not on performance, but on the qualitative ability to express and execute complex workflows and decision processes in a reasonable time – i.e. to replace hours of interactive human Web search by an unsupervised automated process that also may take hours but finally results in one or more proposals, including the optimal one. The process design/programming in the RelCCS language is not expected to be done by casual users, but by skilled process designers in cooperation with domain experts – analogously to application database design. Chapter Structure Section 7.2 repeats the salient concepts of the MARS framework and the Calculus of Communicating Systems (CCS) from Chapter 3 that form the basis of the RelCCS language. The discussion is continued with the syntax and semantics of the RelCCS language, which extends CCS with explicit data flow, in Section 7.3. To further the understanding of the RelCCS language, a solution for the Kevin Bacon number use case (cf. Example 37), as well as an outlook on the running travel planning scenario (cf. Chapter 9), is provided in Section 7.4. Finally, Chapter 7.5 compares RelCCS to prior work.
7.2. Preliminaries: MARS & CCS RelCCS [HML09] is a variant tailored to relational dataflow of the well-known Calculus of Communicating Systems (CCS) process algebra [Mil83]. It has been designed as a part of the MARS (Modular Active Rules for the Semantic Web) framework [MAA05b] whose
141
142 central metaphor is a model and an architecture for active rules and processes that use heterogeneous event, query, and action languages. This distinctive feature of MARS allows to embed sublanguages for queries and even supplementary generic data structures via APIs expressed as actions and queries (cf. Chapter 8) into the workflows to be specified. Here, only the concepts of MARS that are necessary to get the ideas that are relevant for the realization of RelCCS are reiterated; a more detailed overview of the MARS framework can be found in Chapter 3. The MARS meta-model distinguishes rules (not relevant in this thesis), events (that may also occur in CCS workflows as described in this chapter), queries, tests, and actions/processes (cf. Section 7.2.1); the data flow is based on sets of tuples of variable bindings (like in Datalog; cf. Section 7.2.2). The MARS meta-language concept relies on an XML markup for nested expressions of different languages.
7.2.1. The Process Model: Processes and their Constituents The CCS Process Algebra Processes can formally be described by process algebras; the most prominent ones are CCS – Calculus of Communicating Systems [Mil83] and CSP – Communicating Sequential Processes [Hoa85]; CCS was chosen as the base to develop RelCCS. A CCS algebra with a carrier set A (its atomic constituents) is defined as follows (here, the asynchronous variant of CCS that allows for implicit delays is considered), using a set of process variables: • every a ∈ A is a process expression,
• with X a process variable and P a process, X := P is a process definition and a process, and X is a process expression (equivalently, recursive processes can be described by a fixpoint operator; here, the process definition variant is chosen), • with a ∈ A and P a process expression, a.P is a process expression (prefixing; sequential composition), • with P and Q process expressions, Sequence(P ,Q) is a process expression (sequential composition), • with P and Q process expressions, P |Q is a process expression (concurrent composition), P • with I a set of indices, Pi : i ∈ I process expressions, i∈I Pi (binary notation: P1 +P2 ) is a process expression (alternative composition), • 0 is a process that stops execution. Process expressions that do not contain any process variables for which not also a process definition is given, are processes. The (operational) semantics is defined in [Mil83] by transition rules that immediately induce an implementation strategy. By carrying out an action, a process changes into another process as shown in Figure 7.1. Note that prefixing a.Q is actually a special case of Sequence(P ,Q) where P is atomic. While in CCS, the state of a process is encoded in its behavior (via the possible actions),
142
Chapter 7. Tailoring CCS to Query Workflows
143
Figure 7.1 CCS transition rules a
a
a.P → – P , a
a
P→ – P′ P → – P , P i (for i ∈ I) , a a ′ (P, Q) → – (P , Q) – P i∈I Pi → a
a
Q→ – Q′ X := P P → – P′ P→ – P′ , , . a a a P |Q → – P ′ |Q P |Q → – P |Q′ X→ – P′
the definition is generalized to processes with an explicit state described by sets of tuples of variable bindings in Sections 7.2.2 and 7.3.1. Atomic Constituents While in the basic formalism of CCS, all atomic constituents are considered to be actions, in the context of this thesis, atomic constituents are event specifications, queries, tests, and atomic actions: • atomic actions: these are actually executed as actions, e.g. by Web Services, • event specifications as atomic constituents: executing an event specification means to wait for an occurrence of the specified event, incorporate the results in the state of the process, and then continue, • executing a query means to evaluate the query, incorporate the results in the state of the process, and continue the process, • executing a test means to evaluate it, and incorporate the results in the state of the process, and continue appropriately. The approach is parametric in the languages used for expressing the constituents. Users write their processes in RelCCS, embedding atomic constituents in sublanguages of their choice. While the semantics of RelCCS provides the global semantics, the constituents are handled by specific services that implement the respective languages.
7.2.2. State, Communication, and Data Flow via Variable Bindings The state of a process and the dataflow through the process and to/from the processors of the constituents is provided by logical variables in the style of deductive rules, production rules etc.: The state of the computation of a process is represented by a set of tuples of variable bindings, i.e. every tuple is of the form t = {v1 /x1 , . . . , vn /xn } with v1 , . . . , vn variables and x1 , . . . , xn elements of the underlying domain (which is the set of strings, numbers, and XML literals; plus variables can be bound to XML fragments or RDF graphs). Thus, for given active variables v1 , . . . , vm , such a state can be seen as a relation whose attributes are the names of the variables. In the remainder, a process expression P to be executed in a current state R is denoted by P [R], where the associated semantics will be defined via the function [[P, R]]RelCCS . By that, the approach does only minimally constrain the embedded languages. For instance, all paradigms of query languages, following a functional style (such as XPath/X-
143
144 Query), a logic style (such as Datalog or SPARQL), or both (F-Logic [KLW95]) can be used. The semantics of events (that are actually “queries” against an event stream that are evaluated incrementally) is – from that aspect – very similar, and actions take a set of tuples of variable bindings as input. Complete vs. Partial Answers An intuitive and simple assumption is that the whole set of tuples proceeds synchronously through the process, like the view on relational algebra when taught in courses. This assumption is not yet broken when internal iterator-based algorithms apply. In the present case, however, tuples may also proceed asynchronously through the process. Consider first an asynchronous remote service where e.g. a query is submitted, and not answered immediately via the same HTTP connection, but later on, a new connection is established to the requester and the answer is returned. Still, the whole set of tuples is handled together, although there is asynchronous communication. Even more, there may be sources that do not return the complete answer as a whole, but send back some answers that can be computed quickly, and later send back further tuples. This is called a partial answer. As it is preferable to continue the process with the available tuples (even more if it is not known whether later answers will arrive at all), MARS supports this kind of asynchronous processing of workflows. Note that this is obviously needed, when event matching is applied: on every matching event, the processing of the appropriate tuples continues. Thus, tuples that started together can later be at different stages of the same workflow. In the following, this is called dispersed execution.
7.3. RelCCS: The Relational Data Flow Process Language Now, that the required notation has been introduced (or refreshed), the syntax and semantics of the RelCCS language is formally defined in Section 7.3.1. Recursion is discussed separately in Section 7.3.2, while Section 7.3.3 is devoted to the data-oriented operators of the RelCCS language, and Section 7.3.4 comments on implementational details in the context of the MARS framework. In the remainder of this section, some familiarity with the relational algebra is assumed. For an introduction, the interested read is referred to [AHV95, Lau05]. As usual, π, σ, ⊲⊳, ¤< , ρ denote relational projection, selection, natural join, left join, and renaming, respectively.
7.3.1. Syntax and Semantics of RelCCS RelCCS combines the constructs of CCS with relational data flow. Syntactically, it uses mnemonic names (which are also used in its XML markup) instead of the CCS operator
144
Chapter 7. Tailoring CCS to Query Workflows
145
symbols that are shown in Figure 7.1. Let P denote the set of process expressions, and let V denote the set of variable names. For a given finite set Var ⊆ V, Tuples(Var) denotes the set of possible tuples over Var. As usual, 2Tuples(Var) denotes the set of sets of tuples over Var. A given set R of tuples of variable bindings is thus an element R ∈ 2Tuples(Var) .
The mapping [[·, ·]]RelCCS : P × 2Tuples(Var) → 2Tuples(Var) specifies the formal semantics by mapping a process expression P ∈ P and a set R of tuples of variable bindings to the set [[P, R]]RelCCS of tuples of variable bindings that result from execution of a process P in state R. The definition of this denotational semantics [[P, R]]RelCCS by structural induction over P is given below. Example 38 Consider a simple query q whose answers are all pairs (c, b) such that c is a country and b is a city in c with more than one million inhabitants: [[q, {(c/“Germany”), (c/“Austria”), (c/“Switzerland”), (c/“Joe”)}]]RelCCS = {(c/“Germany”, b/“Berlin”), (c/“Germany”, b/“Hamburg”), (c/“Germany”, b/“Munich”), (c/“Austria”, b/“Vienna”)} There is no resulting tuple for “Switzerland”, because there are no cities with more than one million inhabitants in Switzerland, and there is no resulting tuple for “Joe” since it is not a country at all. Also, answer tuples to q like {c/“France”, b/“Paris”} do not belong to the result because their value for c does not match any value of c of the initial tuples. Note that [[·, ·]]RelCCS is just the declarative semantics that does neither depend on, nor prescribe the operational details of actual evaluation. The query in Example 38 may be answered by computing R ⊲⊳ σ[population > 1000000](City) for a suitable database relation City, or iteratively a Deep Web view q ′ can be evaluated for every country, yielding e.g. q ′ ((c/“Germany”)) = {(b/“Berlin”), (b/“Hamburg”), (b/“Munich”)} and generating the result set incrementally from the answers. The situation is similar to the definition of the formal semantics of the relational algebra, and actual query optimization and evaluation. RelCCS will be used in the remainder (mainly) to combine different Web data sources – i.e. the views they expose – in query workflows. Yet, the semantics of RelCCS only prescribes the data flow on the process level and every query language that has an associated processor registered with the Language and Service Registry (LSR) (cf. Chapter 3) can be used. The presentation continues with the atomic constituents of RelCCS processes. Atomic Constituents For atomic constituents p, [[p, R]]RelCCS extends R (Queries, Events), restricts R (Tests), or (Actions) just uses R and leaves it unchanged: • Actions: executing Action(a)[R] means to execute a for each tuple in R without changing the state R, i.e. [[Action(a), R]]RelCCS := R, plus external side effects of a. • Query(q)[R]: R provides the input parameters to the query q. In contrast to Section 4.3, where the complete Web data source was formalized as a characteristic predicate, here, a query q – which may be evaluated via a view of a Web
145
146 data source – can similarly be seen as a predicate q0 (its characteristic predicate, which is constituted by the relation that contains all possible input/output combinations) – e.g. for a Web data source, this would be the characteristic predicate of a single view – over variables v = {v1 , . . . , vn }, from which some variables qin = {vin1 , . . . , vink } ⊆ {v1 , . . . , vn } act as input variables, the others qout = {vout1 , . . . , voutm } = v \ qin act as output variables2 . Given a tuple t ∈ R, the input tuple for q is tq := π[qin ](t) and [[Query(q), {tq }]]RelCCS := {t′ ∈ q0 : tq ⊆ t′ }. With this, let [[Query(q), {t}]]RelCCS := {t} ⊲⊳ [[Query(q), {tq }]]RelCCS , and analogously, S [[Query(q), R]]RelCCS := t∈R [[Query(q), {t}]]RelCCS = R ⊲⊳ q0 . An equivalent set-oriented characterization is obtained by Rq := π[qin ](R), and S [[Query(q), Rq ]]RelCCS := tq ∈Rq [[Query(q), {tq }]]RelCCS , and then, [[Query(q), R)]]RelCCS = R ⊲⊳ [[Query(q), Rq ]]RelCCS holds. Note that if a source answers partially, the whole process becomes dispersed as the tuples are propagated independently through the evaluation. The notion of the characteristic predicate is illustrated further in Example 39. • Test(c)[R]: the tuples t ∈ R that satisfy the test survive: [[Test(c), R]]RelCCS = σ[c](R), like SQL’s SELECT * FROM R WHERE c. Optionally, the test can be parameterized with a quantifier (exists|notExists|all) where the whole set R of tuples is taken and if one, none, or all tuples t ∈ R satisfy the test, the result is R, otherwise ∅, e.g. [[Test[exists](x = 3), R]]RelCCS = R if for some tuple t in R, the value of the variable x is 3, otherwise [[Test[exists](x = 3), R]]RelCCS = ∅, similar to SQL’s WHERE [NOT] EXISTS [...NOT ...] (cf. discussion of MARS tests in Section 3.3.1). • Event(ev)[R] is analogous to queries: for each tuple of R, events matching the given event specification are caught and the variable bindings are appropriately extended. For the present application for query answering, events actually play a minor role; they can be used for designing complex workflows manually. Here, just the semantics is given for the sake of completeness: given an event occurrence ev0 that matches the event specification ev for a certain tuple t ∈ R resulting in a set of tuples ev0 (t), [[Event(ev), R]]RelCCS contains R ⊲⊳ ev0 (t) (the actual semantics of “matching” depends on the embedded event specification language, cf. Section 3.2.1). Thus, at a given timepoint τ , [[Event(ev), R]]RelCCS = R ⊲⊳ {ev0 (t) : ev0 occurred between “starting” ev[R] and τ }. Note that event detection inherently makes the evaluation process dispersed since the evaluation is continued separately for each occurrence of a matching event. Example 39 The characteristic predicate of the view (actor) ← movie2actor(movie), named movie2actor0 , is defined over the variables v = (actor, movie) with qin = (movie), and qout = (actor). The characteristic predicate of the inverse view (movie) ← actor2movie(actor) 2
Note that here qout = v \ qin , where in Section 4.3, qout ⊆ v \ qin , since for the latter, each view may access different “slices” of the characteristic predicate of the respective Web data sources.
146
Chapter 7. Tailoring CCS to Query Workflows
m2a
:=
147
movie
actor
“Pulp Fiction” “Pulp Fiction” ... “Mystic River” “Mystic River” ...
“Samuel L. Jackson” “John Travolta” ... “Tim Robbins” “Kevin Bacon” ...
Table 7.1.: Extension of the characteristic predicates movie2actor0 and actor2movie0 (actor2movie0 ) is also defined over the variables v, but the input and output variables are vice versa. For both views, the extension of the characteristic predicates is given by the relation m2a depicted in Table 7.1. Now, let R1 := {(movie/“Pulp Fiction”)}, and R2 := {(actor/“Kevin Bacon”)}, then the query movie2actor((movie/“Pulp Fiction”)) can be answered by computing R1 ⊲⊳ m2a, and the query actor2movie((actor/“Kevin Bacon”)) can be evaluated as R2 ⊲⊳ m2a. Note that the relation m2a is not related to the way the data is stored at the Internet Movie Database in any way, but is only implicitly given as extension of the characteristic predicates of the two views. Operators For every evaluation P [R], the set R of initial tuples is transformed by executing the process P , resulting in a new relation [[P, R]]RelCCS as “outcome” that is returned to the superior process. • Prefixing, Sequence: execute Sequence(P, Q)[R] by executing P [R], yielding R′ , and then executing Q[R′ ]. This “common” interpretation of sequence builds actually upon the inner join: [[Sequence(P, Q), R]]RelCCS := [[Q, [[P, R]]RelCCS ]]RelCCS . More explicitly, [[Sequence(P, Q), R]]RelCCS = [[P, R]]RelCCS ⊲⊳ [[Q, [[P, R]]RelCCS ]]RelCCS (analogously, [[Sequence(P1 , . . . , Pn ), R]]RelCCS is defined inductively). As a more general idea, tailored to the (more accidentally sequential) evaluation of queries, instead of ⊲⊳, also left/right/full outer joins make sense, and even a modified form of relational difference as negation. For that, Sequence is parameterized as Sequence[join] (default), Sequence[(left|right|full)–outer–join], and Sequence[minus]. The semantics of the (other) join variants is straightforward, and is thus omitted. For the definition of the semantics of Sequence[minus](P1 , . . . , Pn ), assume Ri := [[Sequence[minus](P1 , . . . , Pi ), R]]RelCCS after step i (for i = 0: R0 := R) and S := [[Pi+1 , Ri ]]RelCCS of step i+1. Then, let [[Sequence[minus](P1 , . . . , Pi+1 ), R]]RelCCS := Ri \ (Ri ¤< S). An application of the minus “mode” is for evaluating queries that require negation. A first intuitive example would be q1 (A, B, X) ∧ ¬∃C, Y : q2 (B, C, Y ) that can be evaluated as Sequence[minus](Query(q1 (A, B, X)), Query(q2 (B, C, Y ))). A more “naturalistic” query workflow using the minus mode is presented in Example 40.
147
148 • Alternative(P1 , . . . , Pn )[R] and Union(P1 , . . . , Pn )[R]: each branch is started with R. For the (full) union, the result tuples are the union R1 ∪ . . . ∪ Rn of the results of its branches, [[Union(P1 , . . . , Pn ), R]]RelCCS = [[P1 , R]]RelCCS ∪ . . . ∪ [[Pn , R]]RelCCS . For the alternative, the following operational restriction holds: all branches have to be guarded, i.e. before the first action is executed, a test must be executed (optionally preceded by queries to obtain additional information). For instance, in Alternative(Sequence(Test(c), P1 ), Sequence(Test(¬c), P2 )), all tuples that satisfy c will actually run through the first branch, and the others run through the second branch. In the case that the guards of the branches are exclusive, it holds that [[Alternative(P1 , . . . , Pn ), R]]RelCCS = [[Union(P1 , . . . , Pn ), R]]RelCCS . If the guards are non-exclusive, the actual outcome is nondeterministic: for each tuple t, the “quicker” branch will preempt the others, and exclusively contribute [[Pi , {t}]]RelCCS to the result. • Concurrent(P1 , . . . , Pn )[R]: each branch is started with R. The result is defined by [[Concurrent(P1 , . . . , Pn ), R]]RelCCS := [[P1 , R]]RelCCS ⊲⊳ . . . ⊲⊳ [[Pn , R]]RelCCS , i.e. each tuple runs through all branches (possibly being extended with further variables), and the results are joined. Note that if a tuple is removed in some branch, it will not occur at all in the result. In presence of a dispersed branch, for every incoming answer from a branch, it is checked whether all branches already answered for the tuple. If yes, the joined result is forwarded to the subsequent processing; the remaining process will then also be dispersed. Like for sequences, the operator is also parameterized: in addition to ⊲⊳, left/right/full outer join and relational difference are also allowed – in these cases, the branches must not be dispersed. • Deadlock/Zero: The 0 process does nothing, and it does not continue: [[0, R]]RelCCS = ∅. It is e.g. used for terminating iterative processes (cf. Section 7.4). Example 40 The minus mode is a natural choice when a Web data source can be used to filter the so far accumulated tuples in a query workflow. Assume for instance the following query workflow that first assembles the movies with a rating higher than 8.5 on the Internet Movie Database that have been released after the year 1990 (tuples (i)), and afterwards removes those that have grossed more than 200 million dollars (tuples (ii) – note that the view (movie, boxOffice) ← topBoxOfficeMovies() is based on the “All-Time Worldwide Box office” list of the Internet Movie Database, which only lists movies that have grossed more than 200 million USD, and consequently an additional test on the set tuples (ii) can be saved): 1 2 3 4
Concurrent[minus]( Sequence(Query((rating, year, movie) ← topRatedMovies()), Test(rating > 8.5), Test(year > 1990)), Query((movie, boxOffice) ← topBoxOfficeMovies()))
148
Chapter 7. Tailoring CCS to Query Workflows
149
movie
year
rating
“The Shawshank Redemption” “Pulp Fiction” ... ´ “Leon” “Forrest Gump”
“1994” “1994” ... “1994” “1994”
“9.1” “8.9” ... “8.6” “8.6”
Table 7.2.: #(T uples) before evaluation of the Concurrent[minus] operator = 16 movie
year
rating
“The Shawshank Redemption” “Fight Club” “Cidade de Deus” “The Usual Suspects” “Memento” ´ “Leon”
“1994” “1999” “2002” “1995” “2000” “1994”
“9.1” “8.7” “8.7” “8.7” “8.6” “8.6”
Table 7.3.: #(T uples) after evaluation of the Concurrent[minus] operator = 6 The intuition behind this query workflow is to determine, if there is a connection between user rating and the box office result of a movie. The query workflow determines in parallel the sets of tuples (i) and (ii) and then computes the set difference of (i) and (ii) based on the common variable “movie”. The set of tuples (i) is illustrated in Table 7.2 and consists of 16 tuples, and the set of tuples (ii) is shown in Table 7.3 and consists of 6 tuples, i.e. 10 of the 16 movies were rated high and were also a commercial success.
7.3.2. Recursive Processes in RelCCS Recursive processes extend the expressiveness of RelCCS from that of relational algebra (trees) to that of recursive Datalog, which e.g. allows to compute the transitive closure. Recursive processes are defined by (i) giving and naming a process definition, and (ii) then using this definition somewhere in the process/tree. Since logical variables can be bound only once, variables that are bound to different values in each iteration must be considered to be local to the current iteration. They can be bound either when starting the iteration, or in some step inside the iteration. Only the final result is then bound to the actual logical variable. For a process expression P ∈ P, pname[local : lv1 , . . . , lvn ] := P defines pname to be P where the variables lv1 , . . . , lvn are local. Syntactically, the use of process definitions is of the form (e.g. in a sequence): Sequence(. . . , CallProcess(pname[lvk1 ← vℓ1 , . . . , lvkm ← vℓm ]), . . .)
149
150 with the following semantics: let Var denote the set of active variables used in the surrounding context. The definition of pname is invoked based on the current tuples, where each tuple is extended or modified by initializing the local variables lvk1 , . . . , lvkm (ki ∈ {1, . . . , n}) with the values of the variables vℓ1 , . . . , vℓm ∈ Var. More formally: [[CallProcess(pname[lvk1 ← vℓ1 , . . . , lvkm ← vℓm ]), R]]RelCCS := [[P, {t ∈ Tuples(Var ∪ {lvk1 , . . . , lvkm }) | exists t′ ∈ R s.t. ρ[vℓ1 ← lvk1 , . . . , vℓm ← lvkm ](π[{lvk1 , . . . , lvkm }](t)) = π[vℓ1 , . . . , vℓm ](t′ ) and π[Var \ {lv1 , . . . , lvn }](t) = π[Var \ {lv1 , . . . , lvn }](t′ )}]]RelCCS This is a subset of Tuples(Var ∪ {lv1 , . . . , lvn }); recursive processes call themselves inside their definition (cf. Example 41 and Section 7.4.1) – there, {lvk1 , . . . , lvkm } ⊆ Var. Example 41 Consider a simple countdown process that, started with an integer n, simply calls itself again with n – 1, n – 2, . . . , 0, and then finishes (this pattern could e.g. be used to trigger an action n times):
1 2 3 4
countdown[local : n, m] := Alternative(Sequence(Test(n = 0), 0), Sequence(Test(n > 0), Query(m := n – 1), CallProcess(countdown[n ← m])))
Assume the process is started with n = 6: initially m is undefined, i.e. the current variable bindings are {(n/“6”)}. The test in the first branch (Line 2) is not satisfied, but the test in the second branch (Line 3) is satisfied, and the variable bindings after the evaluation of the query are {(n/“6”, m/“5”)}. Now, at the start of the next iteration the variable bindings are {(n/“5”)} (i.e. m is again undefined). This continues until the last round, where the iteration is started with the variable bindings {(n/“0”)}, and then the test in the first branch is satisfied and the process terminates. In XML, this process is represented as (the recursive call is invoked on Line 28 with the element ):
1 2 3 4 5 6 7 8 9
150
Chapter 7. Tailoring CCS to Query Workflows
10
11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
151
0 ]]>
7.3.3. Data-Oriented RelCCS Operators While the above operators extend the classical CCS operators that focus on the control flow with relational state, additional operators integrate unary relational operators: projection, duplicate elimination, top-k, and grouping. Projection and Duplicate Elimination In relational algebra, projection is a very useful operator to reduce intermediate results when some variables are no longer needed. For RelCCS, Projection(v1 , . . . , vn ) with a specification which variables to keep does the analogue during execution of a process. Distinct removes duplicate tuples, and is usually applied after a projection. The RelCCS Top-K Operator The top-k operator is known as a useful extension for query answering, especially in the context of joins [IAE04]. It allows to “take the best k answers” and continue. For instance, when a set of potentially relevant airports are known, only the nearest 10 to the starting place should be considered for continuing the process. Here, the top-k functionality is adapted to dispersed processing of RelCCS workflows: applied
151
152 to a set of tuples over variables v1 , . . . , vn , TopK(k, m, t, mapfct, datatype, order, cont?) acts as follows: • wait until either m tuples are present, or t time units have passed, • then, for each tuple, compute mapfct(v1 , . . . , vn ) (expressed as an embedded query) which yields a value of datatype. Order them according to order (which can be either asc or desc) and take the top k, and return them. • if cont is true, then for every tuple coming in later, check if it is amongst the best k up to now. If yes, return it, otherwise discard it. This switch is optional (indicated by “?”) and by default false is assumed. The top-k operator will feature prominently in the implementation of the travel planning application scenario in Chapter 9. Grouping Another commonly used operation in relational databases is to group a result set by a list of variables. Then, for each thus built group, a function that returns a scalar value is evaluated, and the result is bound to a fresh variable. In RelCCS, this is supported by GroupBy((v1 , . . . , vn ), (grpfct1 , . . . , grpfctk )), where the grouping function grpfcti = [fcti (vjl1 , . . . , vjli ), vn+i ] (i ∈ {1, . . . , k}) binds the result of evaluating the function fcti (vjl1 , . . . , vjli ) to the variable vn+i . The operational semantics of the operator is to first group all tuples by (v1 , . . . , vn ), then compute all grouping functions grpfcti (i ∈ {1, . . . , k}) for each group separately, and finally to project on the variables (v1 , . . . , vn , vn+1 , . . . , vn+k ). Example 42 illustrates the use of the GroupBy operator in a query workflow from the movie domain. Example 42 Suppose, a user is interested in the reception of movies by her favorite actors “Al Pacino” and “Jack Nicholson”, in which they starred in the years 1991 – 1999. The list comprises 11 movies starting with “Frankie and Johnny” (1991) and ending with “Any Given Sunday” (1999) for “Al Pacino”, and comprises 9 movies starting with “Man Trouble” (1992) and ending with “As Good As It Gets” (1997) for “Jack Nicholson”. She devises a query workflow that – started with the initial tuples {(actor/“Al Pacino”), (actor/“Jack Nicholson”)} – first assembles the relevant movies (bound variables: (actor, movie)), then enriches each movie with the associated rating and the year it was in theaters (bound variables: (actor, movie, rating, year)) and afterwards constrains the set of tuples to the relevant years. Finally, she groups by the actor (resulting in two groups) and computes the average over the ratings for all movies (i.e. in this case there is only one grouping function):
1 2 3
Sequence( Query((movie) ← actor2movie(actor)), Query((rating, year)← movieInfo(movie)), Test(year > 1990), Test(year < 2000)), GroupBy((actor), ([avg(rating), rating–avg])))
The average rating for the eleven movies of “Al Pacino” is:
152
Chapter 7. Tailoring CCS to Query Workflows
153
(6.4+7.9+7.7+7.9+6.1+8.3+6.1+7.7+7.3+8.0+6.6) 11
= 7.3, and
the average rating for the nine movies of “Jack Nicholson” is: (4.6+7.6+6.5+6.0+6.3+6.1+5.5+6.3+7.8) 9
= 6.3.
Thus, the final variable bindings of the process (after the projection) are: {(actor/“Al Pacino”, rating-avg/“7.3”), (actor/“Jack Nicholson”, rating-avg/“6.3”)}. If she is now interested in the highest or lowest rating for a movie in this period, she can easily adapt the grouping function to max, or min respectively. Alternatively, she can just add the respective grouping functions, e.g. the resulting tuple that includes the maximum and minimum rating would be: {(actor/“Al Pacino”, rating-avg/“7.3”, max/“8.3”, min/“6.1”), (actor/“Jack Nicholson”, rating-avg/“6.3”, max/“7.8”, min/“4.6”)}
7.3.4. Technical Realization RelCCS has been implemented as a language service within the MARS framework. RelCCS processes are given as XML documents (or as RDF graphs), borrowing the main principles from the MARS ECA-ML markup language (cf. [MAA05b] and Chapter 3). The language markup has the usual form of a tree structure over the CCS composers in the ccs namespace. Every expression (i.e. the CCS process, its CCS subprocesses, the event specification, test, queries, and the atomic actions) is an XML (sub)tree whose namespace (i.e. the URI associated with the prefix) indicates the language. The services are implemented in Java, using a common set of basic classes that handle e.g. the variable bindings. The actual data exchange is done in the XML format for results and variable bindings where for larger numbers of tuples an SQL database is used as backend [May09]. Determining an appropriate service and organizing the communication is performed by the Languages and Services Registry (LSR) and the Generic Request Handler (GRH) (cf. [FMS08] and Chapter 3).
7.4. Use Cases So far the examples in this chapter were mostly focused on illustrating the use of certain operators in small, isolated query workflows. This section presents two more involved use cases that require the interplay of many different operators to achieve their means. The first is centered around the Bacon Numbers (cf. Example 37 on page 140), which is a measure of relatedness between actors, is dealt with in Section 7.4.1. The second use case gives a first impression of the intricacies encountered when trying to automate travel planning in Section 7.4.2.
153
154
7.4.1. Bacon Numbers In the movie domain there is a phenomenon called the Bacon numbers3 named after the actor Kevin Bacon, similar to the Erd¨os number in mathematics (and computer science) publishing: the underlying graph contains an edge between two actors if both acted in the same film. In this context, the distance between two actors is defined as the length of the shortest path between them in this graph. The basic idea behind this query workflow is to start with the actor Kevin Bacon and then simulate a breadth-first like behavior by recursively traversing the implicitly given co-actor graph. Operationally, this can be achieved by first querying the movies Kevin Bacon himself has played in with the Deep Web view (movie) ← actor2movie(actor). Afterwards this set of movies is used to obtain the set of actors with Bacon distance 1 by invoking the Deep Web view (actor) ← movie2actor(movie) using the movies as input. This concatenation of the two views is at the core of the process definition that is recursively called until the desired goal actor is found or the search space is exhausted (i.e. all actors that can be transitively reached starting from “Kevin Bacon” have been found). The XML serialization of the RelCCS language is introduced in the remainder alongside the discussion of the respective query workflows. So far, a pseudo-syntax has been used for RelCCS sample processes (except in Example 41), but for more complicated query workflows the proper nesting and understanding is more accessible in its XML rendition, e.g. the generalized Bacon numbers – the shortest distance between two arbitrary actors – are computed by this query workflow:
1
"Kevin Bacon"
2 3 4 5 6 7 8 9
10 11 12 13
"Samuel L. Jackson"
14
15 16 17
3
See http://oracleofbacon.org and Example 37 on page 140.
154
Chapter 7. Tailoring CCS to Query Workflows
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
155
0
155
156
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
The whole process is a sequence that initially binds the three input variables from, which designates the start of the graph search (in this case set to “Kevin Bacon”), to, which references the target actor (here “Samuel L. Jackson”), and lastly steps, which keeps track of the shortest distance between from and to in the graph, initialized to 0. Note that the process can be used to compute generalized Bacon numbers, by changing from to the desired actor. The initialization itself is done via Querys, leveraging the MARS capability to address opaque, external languages (here XPath) as presented in Chapter 3. The next step defines the recursive process definition foo:nextStep on Line 22 (delimited by ...), and then starts the actual execution by calling foo:nextStep (Line 82) with the local variable inActorl bound to the value of the variable from (here “Kevin Bacon”), and by using the value of the variable steps for a variable that is local to foo:nextStep, called steps as well. The other local variables of this recursion round will be bound later. The execution of the process continues in Line 28 with an Alternative and the initial tuple t1 := (inActorl /“Kevin Bacon”, stepsl /0, from/“Kevin Bacon”, to/“Samuel L. Jackson”).
156
Chapter 7. Tailoring CCS to Query Workflows
157
The variables local to the recursive process definition are depicted with the subscript “l” to distinguish them from the “globally” bound variables. The first alternative branch (Line 29) checks whether at least – indicated by the quantifier exists – in one tuple the variable inActorl has the same value as the variable to, which is clearly not the case (“Kevin Bacon” 6= “Samuel L. Jackson”). But the test of the second branch (starting on Line 43), which requires that no tuple satisfies the condition of the first branch, is satisfied and the process continues with the incrementation of the step counter in Line 51, and binds the value stepl +1 to the variable stepsNewl . Note that it is not possible to bind the result directly to the old variable because the value of local variables can only be initialized with a new value when invoking a process, or once during the execution (cf. the semantics of recursive processes in Section 7.3.2). Next, the view (movie) ← actor2movie(actor) is evaluated via the Deep Web Query Language (DWQL, cf. Chapter 5) on Line 57, yielding all movies in which inActorl (here “Kevin Bacon”) played, resulting in tuples over variables (inActorl , stepsl , stepsNewl , moviel , from, to):
1 2 3 4 5
{(inActorl /‘‘Kevin Bacon’’, . . . , stepsNewl /1, moviel /‘‘The Air I Breathe’’, . . . , to/‘‘Samuel L. Jackson’’), ..., (inActorl /‘‘Kevin Bacon’’, . . . , stepsNewl /1, moviel /‘‘Loverboy’’, . . . , to/‘‘Samuel L. Jackson’’)}
For the respective movies, the Deep Web query (actor) ← movie2actor(movie) (Line 63) yields all actors who were casted in moviel , resulting in tuples now over (inActorl , stepsl , stepsNewl , moviel , outActorl , from, to) (the output variable actor of the view movie2actor has been renamed on-the-fly by the MARS framework, cf. Line 65 of the process and Section 3.2.1):
1 2 3 4 5 6 7 8 9
{(inActorl /‘‘Kevin Bacon’’, . . . , stepsNewl /1, moviel /‘‘The Air I Breathe’’, outActorl /‘‘Sarah M. Gellar’’, . . . , to/‘‘Samuel L. Jackson’’), (inActorl /‘‘Kevin Bacon’’, . . . , stepsNewl /1, moviel /‘‘The Air I Breathe’’, outActorl /‘‘Forest Whitaker’’, . . . , to/‘‘Samuel L. Jackson’’), ..., (inActorl /‘‘Kevin Bacon’’, . . . , stepsNewl /1, moviel /‘‘Loverboy’’, outActorl /‘‘Kyra Sedgwick’’, . . . , to/‘‘Samuel L. Jackson’’), (inActorl /‘‘Kevin Bacon’’, . . . , stepsNewl /1, moviel / ‘‘Loverboy’’, outActorl /‘‘Marisa Tomei’’, . . . , to/‘‘Samuel L. Jackson’’)}
After that, the recursive definition is invoked (cf. Line 75), with the local variable inActorl initialized with the current value of outActorl (while the local variables moviel ,
157
158 outActor, stepsNewl remain unbound at the beginning of this round – resulting in the same bound variables as for t1 before). Additionally, the value of stepsNewl is copied to the local variable stepsl with the current value of stepsNewl (= 1 in the first round):
1 2 3 4 5
{(inActorl /‘‘Sarah M. Gellar’’, stepsl /1, from/‘‘Kevin Bacon’’, to/‘‘Samuel L. Jackson’’), (inActorl /‘‘Forest Whitaker’’, stepsl /1, from/‘‘Kevin Bacon’’, to/‘‘Samuel L. Jackson’’), ..., (inActorl /‘‘Kyra Sedgwick’’, stepsl /1, from/‘‘Kevin Bacon’’, to/‘‘Samuel L. Jackson’’), (inActorl /‘‘Marisa Tomei’’, stepsl /1, from/‘‘Kevin Bacon’’, to/‘‘Samuel L. Jackson’’)}
In this round, the test whether there is a tuple where inActorl = to fails as well, and again the value stepl +1 (1+1 = 2) is bound to the variable stepsNewl , followed by the same extension step as above: for each tuple, based on inActorl the movies are retrieved, and the co-acting colleagues are also retrieved. Now, consider the movie “Quantum Quest: A Cassini Space Odyssey”, where both “Sarah M. Gellar” and “Samuel L. Jackson” were casted. The resulting tuple for these entries is (inActorl /“Sarah M. Gellar”, . . . , stepsNewl /2, moviel /“Quantum Quest: A Cassini Space Odyssey”, outActorl /“Samuel L. Jackson”, . . . , to/“Samuel L. Jackson”). As soon as this tuple has been retrieved during evaluating the DWQL query, the DWQL engine stops querying and returns the so far variable bindings, since it has found a witness (indicated by the ... directive in Line 67, cf. Section 5.2). Here, although we are only in the second round of the recursion, there are already > 200 movies that need to be queried for new actors, and the “filter pushing” (that was introduced as DWQL extension in Section 5.2) saves on average > 100 queries for this query. The next iteration is invoked, now with – amongst others – the tuple (inActorl /“Samuel L. Jackson”, stepsl /2, from/“Kevin Bacon”, to/“Samuel L. Jackson”). This time, the test of the second branch (Line 44) fails, while the test of the first branch (Line 30) succeeds since there exists some (i.e. the above) tuple whose inActorl equals the intended goal to. Then, an event is raised that contains the number of steps (Line 37) – alternatively another action, such as sending an e-mail could have been invoked – and execution stops. Although the query workflow above serves the intended purpose, i.e. computing the Bacon number, it does not justify its results, e.g. by returning the concrete connection as a list of actors that are connected via movies. This could also be encoded as variables in the process itself, but this would result in a harder to understand process, since the collection of paths is not the original intention of the query workflow but is better dealt with externally. This gives rise to the need for a graph data type that manages so far found results and that can be queried during run time to influence the behavior of the query workflow. The next section follows this line of thought and develops a list of (further) desiderata for such a graph data type.
158
Chapter 7. Tailoring CCS to Query Workflows
159
7.4.2. Travel Planning This section gives an outlook on the realization of the running travel planning scenario with a special focus on the requirements of data management during the query workflow. Assume a user wants to find either the cheapest or shortest – in terms of total time spent travelling – route to a given location, e.g. for a conference travel, or a combination of both. Human, manual search usually employs some kind of intuitive search strategy. The basic idea is to find a combination of flights, trains, buses, and probably ferries and taxis that leaves enough time to change from one means of transport to the other, and that satisfies the desired goal criteria best. Often, the first subtask is to consider the nearby airports, then to try to cover as much distance as possible by plane (assuming the distance is above a certain threshold), and then bridge the remaining distance by train or bus; if this fails, do backtracking. With the means of the presented approach, such tasks can be formulated as query workflows. The backtracking is here replaced by a kind of breadth-first search, where the search space can be pruned by applications of the top-k operator on intermediate results. The search space is again explored stepwise. Instead of programming the breadth-first search explicitly in the query workflow itself, it would be beneficial, if the actual search algorithm, which is to be applied on the so far found results could be handled external to the query workflow, allowing for a cleaner design and separation of concerns. With this respect, the ingredients for realizing the travel planning scenario are: • heterogeneous Web data sources, such as XML databases, SPARQL endpoints (and views), Web Services, and wrapped Deep Web sources that provide information about trains, flights, required geographical information, etc. RelCCS processes allow convenient access to all these kinds of data sources by embedding queries in heterogeneous languages. • a graph that represents the search space, consisting of places (airports, etc.) and connections. • edges, i.e. atomic connections like a flight “FRA → LIS”, which have properties, such as deptTimeLocal, arrTimeLocal (departure/arrival time wrt. local timezone), price, and duration. • paths, which also may have properties, e.g. deptTimeLocal, arrTimeLocal, price, and duration (now, wrt. the whole path), which are based on inductive definitions (let p be a path ending in vertex x, which can be extended by an edge (x, y), denoted by (p ◦ (x, y)): – let deptTimeLocal: (p ◦ (x, y), deptTimeLocal = t), where (p, deptTimeLocal = t) holds, – let arrTimeLocal: (p ◦ (x, y), arrTimeLocal = t), where ((x, y), arrTimeLocal = t) holds, – let price: (p ◦ (x, y), price = p1 +p2 ), where (p, price = p1 ), and ((x, y), price = p2 ) hold,
159
160 – duration: let (p ◦ (x, y), duration = d1 +t2 –t1 +d2 ), where (p, duration = d1 ), (p, arrTimeLocal = t1 ), ((x, y), deptTimeLocal = t2 ), and ((x, y), duration = d2 ) hold. • conditions when an edge may be used to extend a path. Such conditions are formulated in terms of the properties of the edge and the potential paths. In this case, e.g. if (p, arrTimeLocal = t1 ) and ((x, y), deptTimeLocal = t2 ), then t1 < t2 – ∆ where in turn ∆ may be obtained by an embedded query that depends on the location and on the means of transport. • branches that can be used to implement different alternatives, e.g. doing a case split depending on the remaining distance. Recall, that after executing the branches, all results are again collected. • a goal criterion (which e.g. is a weighted sum of price, duration and the remaining distance) that can be used for pruning the search space. The next chapter introduces the configurable graph data type (CGDT) that satisfies all the abovementioned requirements (and more) and that allows for an adequate modeling of problems that have a graph-shaped search space. It facilitates a clear separation of the application-specific needs in the query workflow, and the data-specific requirements in the CGDT (i.e. modeling the search space adequately).
7.5. Related Work Two already traditional areas that are related to RelCCS are (i) query plans for relational algebra expressions that work on the operator level, and also on the choice of actual algorithms for, e.g. joins [AGM08], and (ii) conjunctive queries over homogeneous or heterogeneous sources – including HTML and XML Web sources (cf. [CM10]) – working on querying issues according to the yet classical wrapper-mediator architecture [Wie92, HGMN+ 97] that provide integrated views on data, but without considering process-oriented aspects of data-oriented query workflows. Since in these areas, the control flow does not play a central role, they are not discussed further here. The presented related work is further divided into two areas: Data Flow and Data Exchange with a special focus on the concept of Tuple Spaces that was pioneered in the Linda system [Gel85], and Workflow and Data Flow Languages. Data Flow and Data Exchange: Comparison to Tuple Spaces The data flow in MARS and RelCCS shares some similarities with Tuple Spaces (in the following abbreviated as “TS”) and its variants [Gel85, WMLF98]. TS are a middleware approach for cooperation and coordination between distributed processors, in the TS context usually called agents. A TS is an unstructured collection of tuples without a fixed schema that allows for associative access: insert, read, read with delete; updates are accomplished by removing and inserting. IBM TSpaces [WMLF98] supports four further types of queries: MatchQuery, IndexQuery,
160
Chapter 7. Tailoring CCS to Query Workflows
161
AndQuery, and OrQuery, that all result in sets of tuples. Similarities between RelCCS and TS are thus in the support for data exchange between autonomous, distributed processors. Also, in both approaches, the data is decoupled from the programs. In TS, data can explicitly exist without being assigned to a certain agent. Communication is anonymous from the point of view of the processors – they get and put tuples from/to the TS. TS are in many aspects similar to relational databases, but they are used differently. With respect to MARS and RelCCS, the following main aspects can be discerned: • TS: Use as communication bus, not permanent storage. This characteristic is shared with MARS/RelCCS, • TS: Unstructured set of tuples. MARS: sets of tuples that belong together, • TS: Associative access operations. Not needed by MARS and RelCCS, • TS: Generally, no predefined schema. In MARS and RelCCS, for each state, all tuples have the same schema, which changes during the processing. So, MARS and RelCCS do not need some of the features of TS. On the other hand, a core requirement of MARS is not covered by TS: tuples in MARS are grouped into sets of tuples (the above relations over the active variable names) and usually assigned to a (single) current processor, and exchanged between processors in a directed and controlled way. Moreover, the RelCCS operators require to apply relational operations on the sets of tuples. Functionally, the definition of sets of tuples belonging together could be emulated in TS by an additional column c0 of the tuples. Nevertheless, TS do not efficiently support operations on such sets of tuples, like e.g. joining the result relation R of a query with the previous tuples, joins of branches of concurrent subprocesses, projection, duplicate elimimation, and top-k. For MARS and RelCCS, using a relational database as “communication bus” is preferable since the operations can easily be mapped to relational operations on database tables [May09]. Note that there is also a realization without any middleware (except plain internet communication) for storing and operating on variable bindings and data exchange by XML (i.e. sets of tuples serialized as XML). Workflow and Data Flow Languages Several proposals have been presented that combine data flow with control flow. Most notably, the focus of approaches based on Petri Nets [Pet62, RR98a, RR98b] is to express workflows completely in a uniform graphical formalism, with a concise formal semantics, which facilitates the application of formal analysis and verification techniques. Although, the original idea dates back to the beginning of the 1960s, Petri Nets are still an actively researched area, and there exists an abundance of extensions and varations of the original formalism [GL81, OS96, vdA98, LO03, YK04, ZCCK04, TCHJ04, HKS+ 05, HKS+ 08]. Extensions of Petri Nets with nested relational structures, which are called NestedRelation/Transition Nets (NR/T-nets), are investigated in [OS96]. There, a proper extension4 of Predicate/Transition Nets (Pr/T-nets) [GL81] with a formal semantics is 4
For first normal form relations, every NR/T-net can be transformed to an equivalent Pr/T-net.
161
162 proposed. It allows for expressive manipulations of the nested relations, such as complex selection conditions, and insert and delete operations on the top level, as well as on subinstances of the nested relations. In contrast, RelCCS is based on the (non-nested) relational data model, i.e. all variables can only range over atomic values. Nested structures can either be represented by multiplying out the relations (as is done typically in relational datababases), or by binding a variable to an XML or RDF fragment. Another difference is that RelCCS has built-in support for the typical relational operators, such as grouping, and negation, which is not the case in NR/T-nets. Generally, the built-in support for higher-level concepts, such as recursion, in RelCCS facilitates easier modeling of processes, where in NR/T-nets, the control flow patterns such as concurrent execution and recursion have also to be encoded within the Petri Net formalism. XML nets [LO03] are also based on Pr/T-nets, and instead of nested relations, XML documents are used as “tokens” in the net. For easier graphical modeling, GXSL, a graphical XML schema definition language, and XManiLa, an XML document manipulation language are proposed. Variables can also be bound to XML fragments in RelCCS (and complete XML documents as well), and the focus is not on the graphical modeling of business processes, but on a declarative formalism with a clear formal semantics for specifying the data flow of query workflows. Dataflow nets – workflow nets [vdA98] that contain no cycles – are proposed in [HKS+ 05, HKS+ 08]. They serve as the basis for a “dataflow” language, which like [OS96], adopts the nested relational calculus (NRC) to define the tokens in the dataflow nets. The type system uses the typical set of primitive data types (e.g. integers, strings, etc.) but also includes XML documents. One notable restriction in their language is that the only collection type that is included are sets. The transitions in the dataflow net can either be the NRC-inherent operators extended with two special nest and unnest transitions that allow to iterate over all entries in a set, or extension transitions that can be implemented as user-defined functions (in their setting typically external Web Services from the bioinformatics domain that process the input tokens and produce new output tokens). Due to the fact that only controlled iteration (using the unnest transition) is possible and no cycles are allowed, termination can be guaranteed for all data flows. Additionally, the soundness – initiated with a single token in the single input node, the data flow will eventually terminate with a single token in the output node – of data flows is guaranteed, if they are constructed in a structured, hierarchical way by adhering to their proposed refinement rules. RelCCS also lends its operators from the relational world; more complex operators, such as top-k could be implemented as user-defined functions in Dataflow nets. The comparison with the NRC data model has already been discussed for [OS96]. Overall, the major difference is that RelCCS is more expressive, since it supports recursion and can thus e.g. compute the transitive closure, which is not the case for dataflow nets that only support controlled iteration and must not have cycles. Petri Nets have also been studied in the area of Web Service composition and orchestration [YK04, ZCCK04, TCHJ04]: [YK04] proposes to use Colored Petri Nets (CP-nets)
162
Chapter 7. Tailoring CCS to Query Workflows
163
[Jen96] for the specification and verification of the composition of WSDL-based Web Services, that can later be automatically converted into WS-BPEL [AAA+ 07]; [ZCCK04] proposes WS-nets (which are based on CP-nets, too) that describe each Web Service component in three layers – the focus is here on verification and monitoring; lastly, [TCHJ04] extends Petri Nets by additional aspects, e.g. time, resources, and taxonomies and also allows for folding of subnets as transitions for describing more complex workflows. Although not the focus of this thesis, RelCCS (and the MARS framework) could also be used for the composition of Web Services; it already has the concept of actions, and rich support for different control flow patterns, such as concurrent execution or alternative execution. Generally, Petri nets (and Petri net-based formalisms) are, like RelCCS, process-oriented. Yet, while RelCCS is based on a set of operators, in Petri Nets, the control flow patterns such as concurrent execution and recursion have also to be encoded within the Petri Net formalism. The evaluation and comparison of different workflow products ideally should be based on a common, agreed upon set of desiderata. The survey in [vdAtHKB03], inspired by the work of Gamma et. al. on Software Design patterns [GHJV95], defines for this task 20 control flow patterns based on the experience of the authors. They claim that the control flow perspective is the essential aspect of a workflow specification, and the data flow “rests” on it. This view is corroborated by a follow-up publication [RtHEvdA05], where 40 data flow patterns are presented, which can be divided into four distinct groups: data visibility, data interaction, data transfer, and data-based routing, i.e. data flow is mainly considered in the context of the control flow. Data flow is a first class citizen in RelCCS, and even more so in its use for specifying query workflows, where the objective of the process is to collect and combine data from different Web data sources, and consequently a comparison to these patterns is of limited interest and is omitted. Nonetheless, the workflow patterns of [vdAtHKB03] have lead to the specification of Yet Another Workflow Language (YAWL) [vdAADtH04, vdAtH05]. A workflow in YAWL is a set of so-called extended workflow nets (EWF-nets) that are organized in a treelike structure. The basic constituents of YAWL workflows are tasks that can be either atomic or composite. Composite tasks are realized as references to unique EWF-nets at a lower level in the tree and one EWF-net is at the root of the workflow tree. In RelCCS terminology, a composite task would be a process definition, with the constraint that it cannot be called from within its definition (thus there is no support for recursive processes in YAWL). The semantics of YAWL is given as a transition system, although the definition of the language started from high-level Petri Nets (Petri Nets extended with colour, time, and hierarchy [vdAvH02]). The data model of YAWL can handle “arbitrary complex data” [vdAADtH04] and “completely relies on XML-based standards like XPath and XQuery” [vdAADtH04]. There is support for variables that can be passed to composite tasks, as well as for receiving result values from composite tasks. In [vdAtH05], also the notion of suitability is introduced as a measure of how easy a real-world scenario can be
163
164 modeled, in contrast to the more classical notion of expressiveness. With this respect, it can be said that RelCCS is more suitable for modeling data-oriented scenarios with its built-in support for expressive relational operators (including top-k), and its direct support for recursion that needs to be simulated in YAWL (which is (also) Turing-complete). The Small Workflow Language (SMAWL) [Ste05] has also been designed with the van der Aalst workflow patterns [vdAtHKB03] in mind, where CCS [Mil83] is used as the formal basis of the language, as well. In fact, SMAWL is translatable into CCS using only source code transformations. However, in contrast to RelCCS, SMAWL does only consider the control flow of workflows and delegates the data flow to an “external” language. Logic-based workflow languages [BK94, vdAP06, RK07] are another paradigm, which can draw on a large body of existing theoretical research results. In Transaction Logic (T R) [BK94] and Concurrent Transaction Logic (CT R) [RK07] the description of a workflow consists of rules that make use of temporal connectives instead of just the Datalog conjunction. The semantics of T R is inherently set-valued. Such rules can be formulated over embedded literals/atoms (called elementary transitions) that are not part of Transaction Logic, but are contributed externally. This is similar to the embedding of the use of the CGDT in RelCCS. The Declarative Service Flow Language (DecSerFlow) [vdAP06] is based on Linear Temporal Logic (LTL) [GH01, HR01, HR02]. DecSerFlow has been developed on the assumption that modeling some workflow patterns with existing procedural workflow languages may result in an over-specification of processes. As a solution, DecSerFlow relies on LTL for its declarative style, and provides a graphical language to specify the control flow (that is then mapped to LTL) to ease modeling for non-expert users; support for data is envisioned for routing purposes. In comparison, RelCCS is a declarative data flow language, which given its concise definition as a set of atomic constituents and combiners (operators) can also be directly mapped into a graphical representation. Where classical workflow systems have mainly been (and are still) designed for supporting elaborate workflow patterns, the focus of Scientific Workflows [ABJ+ 04, HKS+ 05, SKDN05, LAB+ 06, OGA+ 06, CGH+ 06, BvH07, HKS+ 08, HBW09, ZBML09, TCS+ 09] is on the data flow and on the management of large and intricate data. Kepler [ABJ+ 04, LAB+ 06] is a scientific workflow system built upon the Ptolemy II system [BHLM02]. It reuses the director (controls the execution model) and actor (basic workflow steps, e.g. data sources) paradigm and extends it with features that facilitate (among others) the easy prototyping of workflows and the use of distributed computing resources, e.g. in a Grid [FK04] environment, in a distributed scientific workflow. By separating the execution model in the director component, different execution semantics can be employed. The semantics in RelCCS is static in that it cannot be replaced on demand; yet, since the RelCCS semantics has been catered specifically for its intended use, this does not impose any restrictions for its use in query workflows. The analogy to a Kepler actor would be a Web data source view in this thesis. The major difference is that Kepler does manage the flow of data between different actors, while RelCCS internally
164
Chapter 7. Tailoring CCS to Query Workflows
165
operates on a relational data model, which allows for powerful utilization of the data on the language level itself, e.g. for computing aggregates over groups. Triana [CGH+ 06] is a graphical problem solving environment (PSE) that distinguishes three different types of workflows: serial scientific workflows, job submission workflows, and monitoring workflows. Its workflow model can be represented as a directed cyclic graph, i.e. cyclic connections are allowed, where nodes represent components (basic units of work, defined with an XML syntax that has close ties to WSDL [CCMW01, CMRW07, CHL+ 07]) and edges represent the connections between them. A notable design decision is that Triana has no explicit support for control constructs, i.e. loops and branched execution are handled by specialized components. Consequently, there are “pluggable language readers” that allow to specify the workflow in external workflow languages, such as WS-BPEL [AAA+ 07]. The execution of workflows itself is decentralized and parts of the workflow can be executed at different machines. The current implementation of RelCCS is centralized in the sense that the execution of the query workflows itself is done at one machine, which has proven to be adequate so far. However, given the tree-like nature of the language, (operator) subtrees could be easily evaluated at different machines operating on a global state (assuming a globally synchronized access to the data of the query workflow). Also, where Triana has no built-in control constructs (and thus no associated semantics as well), RelCCS is comprised of a set of well-defined operators with a concise semantics. Calvin [HBW09] is a workflow system geared towards casual users that have no extensive background in computer science. The workflow models in Calvin are trees, where all nodes in the tree represent BioMoby [WL02] activities. This intermediate workflow model is then automatically transformed into WS-BPEL [AAA+ 07]; as a consequence of the simplicity of the workflow model, the mapping is not bidirectional. More complex workflows can be realized by resorting to modeling the workflow directly in WS-BPEL, or by extending the automatically transformed workflow. In contrast, RelCCS is intended to be used by domain experts in combination with process designers to assure accurate modeling, without constraining the expressive power of the language. X-CSR (“X-Scissor”) [ZBML09] presents an optimization technique for minimizing the amount of data that needs to be shipped to the actual components that operate on the data. This is similar in spirit to the way the Generic Request Handler (GRH) first asks the respective processor for the relevant input variables and then initially projects on these and afterwards joins the results again to obtain the new active variables (cf. Section 3.4 and Section 5.3.2). While in X-CSR this is done at compile time as optimization step, in MARS/RelCCS this is done at run time during the execution of the query workflow. Workflow-as-a-Service on top of Taverna [OGA+ 06] workflows is proposed in [TCS+ 09] as a means for easy reuse of workflows and improved execution performance – and others, as well – in a Grid environment. Process definitions can be used for a similar purpose in RelCCS; they offer a clear interface and are syntactically represented as XML subtrees (cf. Section 7.4.1). Dataflow nets [HKS+ 05, HKS+ 08] have already been discussed above in the context of Petri Net-based formalisms.
165
166 Lastly, in [SKDN05] the authors argue for a tight coupling of database and workflow management systems and introduce the notion of active tables, which are associated with programs, and active views, which represent the workflows. The introduction of the Configurable Graph Data Type (CGDT) in Chapter 8 will allow for a clear separation of data management issues (in the graph) and the exploration of the search space (in RelCCS), while the management of the RelCCS’ variable bindings in a database has been presented in [May09]. For a more comprehensive survey of existing scientific workflow approaches, the interested reader is referred to the survey [BvH07]. Workflow management systems and related technologies have also found great resonance in industry standards and commercial systems, in general. WS-BPEL [AAA+ 07] is the de facto standard for Web Service composition over WSDL-style services. The data flow in WS-BPEL is described by variables, which can – using appropriate database products like e.g. IBM WebSphere® – reference database tables, and thus be made set-valued. Also datatypes like the CGDT can be embedded into WS-BPEL processes. In [VSS+ 07], optimization strategies of such approaches are discussed, and [VSRM08] gives a general overview of SQL support in commercial workflow systems. Since the standard is lacking a “formal” semantics and is only informally defined alongside its XML serialization, it is no surprise that different attempts have been made to provide it with a more formal grounding [SBS04, BP06, OVvdA+ 07]. The idea underlying all these attempts is to provide a mapping to an existing formalism; in [OVvdA+ 07] these are Petri Nets, in [BP06] a mapping to YAWL [vdAADtH04, vdAtH05] is proposed, and in [SBS04] the “target” language is CCS [Mil83]. The suitability of WS-BPEL to model a collaborative procurement process is studied in [JFGK03], while in [WvdADtH03] a comparison of WS-BPEL with other industry standards for Web Service composition and orchestration is provided – there also WS-BPEL is found to be Turing-complete under the assumption of perfect technology, e.g. unlimited storage. RelCCS has a clear, formal semantics and in contrast to WS-BPEL, where XML is just used as a serialization format, in RelCCS/MARS the XML markup carries important language information for enabling the processing of embedded language fragments. While the main focus of WS-BPEL is on specifying executable processes, the motivation behind the Business Process Modeling Notation (BPMN)5 is the graphical modeling of processes, which borrows ideas from the former industry process exchange standard, the XML Process Definition Language (XPDL)6 (see [HKM06] for a comparison of XPDL and WS-BPEL). An analysis of its formal semantics can be found in [DDO08], where BPMN is mapped to Petri Nets. The Enterprise Mashup Markup Language (EMML)7 is (yet another) emerging industry standard governing the “development, interoperability and compatibility of enterprise mashups”, i.e. combinations of different services that are already available inside a company. Similarly to WS-BPEL its semantics is mainly proWebSphere® is a trademark of IBM Corp. http://www.bpmn.org 6 http://www.wfmc.org/xpdl.html 7 http://www.openmashup.org/omadocs/v1.0 4
5
166
Chapter 7. Tailoring CCS to Query Workflows
167
vided informally alongside its XML serialization, and its data model supports simple and complex types (objects). Furthermore, systems for data-oriented workflows in general can be applied for query answering tasks, which usually have a set-oriented dataflow. The Lixto Suite [GKB+ 04] is an integrated system for implementing data-oriented workflows with a focus on data acquisition and integration. Its process model is less explicit, and the workflows are solely built upon Lixto’s own modules. Typically, different formalisms that try to solve a similar problem are related to each other in terms of what they can encode, which is usually referred to as its expressive power. Since most of the formalisms/standards in this section are Turing-complete, i.e. can encode anything, this notion is not sufficient for appropriately distinguishing them. Thus, the notion of suitability introduced in [vdAtH05] has been proven as a more adequate distinguishing factor in this section. With this respect, the advantage of RelCCS is that it provides both the primitives for control structures and data flow on the same level of the language. A further feature of the language is provided by its embedding in the MARS meta model: RelCCS fragments can be used e.g. as action part in MARS’ ECA rules, and fragments in other languages for specifying complex events, queries and actions can be embedded in RelCCS processes without having to revert to Web Services as intermediate wrappers.
167
168
168
Chapter 8. Support for Graph-Structured Domains Elegance is not a dispensable luxury but a factor that decides between success and failure. – Edsger Wybe Dijkstra (May 11, 1930 – August 6, 2002)
Contents 8.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.2. Conceptual Overview . . . . . . . . . . . . . . . . . . . . . . . . 171 8.3. Signature and Operations . . . . . . . . . . . . . . . . . . . . . 173 8.3.1. Data Definition Language (DDL) . . . . . . . . . . . . . . . . . 174 8.3.2. Data Manipulation Language (DML) . . . . . . . . . . . . . . . 179 8.4. Configurability of the Exploration Process . . . . . . . . . . . 181 8.5. Technical Realization . . . . . . . . . . . . . . . . . . . . . . . . 188 8.5.1. Implementation Details . . . . . . . . . . . . . . . . . . . . . . 188 8.5.2. CGDT as MARS Action and Query Language . . . . . . . . . 190 8.6. Use Case: Bacon Numbers + CGDT . . . . . . . . . . . . . . . 192 8.7. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.1. Introduction A recurring motive when designing query workflows is the computation of (parts of) transitive closures in graphs. Graph algorithms in general are a traditional research topic; they usually assume a given graph and the focus is on employing additional suitable data structures for efficient algorithms. In the context of the Web, the focus is more on online algorithms [Alb03, EGI99]. There, the graph is neither known nor materialized a priori to run algorithms on it, but is explored only at runtime, using one or more Web data sources. Often, like in the case of the travel planning application scenario, even the graph data
169
170 itself is dynamic, which does not allow for materialization or caching. These characteristics require completely different algorithms where the exploration and expansion strategy for the graph itself is the central issue. Most algorithms basically follow a best-first-search like A∗ [RN03], breadth-first-search, or depth-first-search for exploration. This chapter presents the Configurable Graph Data Type (CGDT) that provides an ontology and an API – realized as a MARS query and action language – for configurable graphs. The design of the CGDT combines generic graph behavior (insertion of edges, creation of paths, support for algorithms, etc.) with application-specific configurability. The CGDT allows to encode the maintenance of the stored graph data inside the graph by (i) assigning properties (in the remainder also referred to as attributes) to vertices, edges, and paths, and (ii) specifying how paths are obtained from existing edges and paths during the exploration process. This allows to separate the (also declarative) specification of the actual exploration process and the basic acquisition of data from the Web from the maintenance of the graph itself. The declarative specification of query workflows can be done in RelCCS (cf. Chapter 7), where the CGDT can be embedded as external data type via the support of MARS for heterogeneous languages. Comparison to Classical Graph Algorithms At first sight, the employment of classical “shortest path” graph algorithms like Prim [Pri57, CLR89] or Kruskal [Kru56, CLR89] seems like a good fit for scenarios like travel planning. Yet, a more detailed analysis shows that even under some optimistic assumptions, this would not be an appropriate solution: • dynamics: the complete graph is not available, and is continuously changing, e.g. the availability of flights and their prices changes every moment, • constraints: the paths are further constrained by the requirement that the departure time must be after the arrival time at intermediate airports (cf. Section 7.4.2), • completeness: the airline connections’ graph (which is, neglecting the availability issue, of a size that could efficiently be processed by breadth-first or A∗ search), is not sufficient. Additionally, the connections between airports and the final destination must be considered. Thus, finding the solution depends on further information since below a certain remaining distance the process continues outside this graph. Generalization The above considerations show that in such cases, a large search space has to be explored, and application-specific properties of the paths, like price and duration, have to be maintained incrementally. The stepwise exploration corresponds to inductive characterizations of these properties that are in fact common to the idea of properties of paths in a graph. The CGDT ontology provides generic notions to specify how this information is combined from the actual input (i.e., information about edges obtained from Web data sources). The relevant features of the actual domain ontology are mapped and expressed in terms of the generic ontology. Overall, the CGDT supports the following generic functionality:
170
Chapter 8. Support for Graph-Structured Domains
171
• materializing the relevant graph fragment (including the inductively defined properties) based on the explored edges, • creating paths according to specified criteria, • accessing those recently added vertices that serve for continuing the exploration, • finally querying the result graph. Thus, the design of the domain-specific workflow can then be separated into three issues: 1. describe the domain-specific characteristics of the graph in terms of the CGDT ontology. This consists of the basic schema of the graph, the constructive specification how to extend the graph with relevant newly obtained knowledge, and constraints when newly obtained knowledge is relevant to expand the graph, 2. actual acquisition of the data from the Web (including Deep Web sources). This means to identify appropriate data sources and to encode the access to them. Potentially, for each step also two or more sources must be accessed – for instance one to identify potential edges, e.g. which airports can be reached from a given one, and the second one to query for the actual existence of the edges, e.g. actual availability and departure/arrival times of that connection for a given date, 3. fill in a common breadth-first-search or A∗ -search workflow pattern as a RelCCS process with case splits and Web queries. After configuring the graph once in course of the initialization, the process will only submit edges to the graph, and query it for the vertices where the exploration should be continued. The compilation of the information in the graph itself, and the choice of the vertices for the next step is done automatically by the graph. Chapter Structure The next section takes up the identified general requirements and gives a conceptual overview of the CGDT. Section 8.3 then presents the Data Definition Language (DDL) and Data Manipulation Language (DML) of the CGDT. In Section 8.4 generic notions are added to specify how the graph develops during evaluation of an online algorithm. Afterwards, Section 8.5 discusses implementation-specific issues and the embedding of the CGDT in MARS as query and action language. Finally, Section 8.6 improves upon the Kevin Bacon numbers use case of Section 7.4.1 by leveraging an instance of the CGDT for keeping track of the explicit connection between actors, and Section 8.7 concludes the chapter with an overview of related work.
8.2. Conceptual Overview The basic notions of any graph ontology are vertices, edges, and paths. In the following, only directed, labeled graphs of the form G := (V, E, P ) are considered, where V is the set of vertices, E ⊆ V × V is the set of directed edges between these vertices, P is a set of paths. While in the usual notion of graphs, the set of paths is defined as the transitive closure of edges (i.e. the set of paths is {(v1 , . . . , vn ) | (v1 , v2 ), . . . , (vn–1 , vn ) ∈ E}), the
171
172 set P of relevant paths in a configurable graph is a certain subset of all existing paths in the graph that satisfy additional constraints. Nevertheless, each path p ∈ P is a path in the traditional sense which consists of multiple connected edges. A path p that ends in a vertex x can be extended by an edge (x, y), denoted by p ◦ (x, y). The set P will contain paths that are obtained by such extension steps according to configurable criteria. Note that paths are internally handled by the CGDT instance; each time a set of new edges is added (the CGDT operates set-oriented), the graph instance computes a “fixpoint” of all relevant, newly reachable paths. Properties A central feature of the CGDT is that vertices, edges and paths can be adorned with sets V P , EP , and P P of (typed) properties. Each vertex property vp is associated with a literal type type(vp), taken from the XML Schema datatypes [PGM+ 09] (e.g. string, numeric, or date), and can thus be seen as a mapping vp : V → type(vp) that assigns a value of the given type to each vertex; edge and path properties are defined analogously. The properties can optionally be specified in terms of view definitions over other properties, or by external queries. For instance, given a vertex with its airport code, the timezone can be obtained by querying a Web data view. In the travel planning scenario, the distance of a flight from A to B is given by the geographical distance between A and B’s coordinates, and the price of a path is the sum of the prices of its edges. Note that the duration of a direct flight from A to B is the difference between the local times at A and B (which are properties of the edge) plus the difference of the time zones (which are properties of the vertices), while the duration of a connected flight is not the sum of the durations of the individual edges, but also calculated from the total departure and arrival time (including the connecting times). As is indicated by this example, a powerful mechanism for accessing the relevant properties during instantiation time of new vertices, edges, and paths is required. This is accomplished by providing XML views on the context vertices, edges, or paths as discussed in Section 8.3.1. For properties of vertices and edges where no definition is given, the value must be given when adding the edge to the graph. Often, vertices are added only with their key (when found by exploring edges), and their additional properties must be obtained by external queries that are automatically executed when the vertex is inserted. As paths are not inserted manually, but only by extending an existing path with an edge, all path properties must be derived properties. Here, often an inductive definition over the length of the paths is used. Constraints The CGDT allows for a fine-grained control which vertices, edges, and paths are added to the graph. On a “local” level, conditions can be specified for vertices, edges, and paths that are checked after all properties have been initialized, and the respective vertex, edge, or path is only added if all these conditions are satisfied. They are checked for each single vertex, edge, or path in isolation. The CGDT also supports the comparison of multiple paths, where it can be checked
172
Chapter 8. Support for Graph-Structured Domains
173
if paths subsume each other in order to avoid adding redundant paths to the graph. The actual comparison comprises two steps: first, it is checked for each pair of paths if they are comparable (e.g. by recursively checking if they encounter the same vertices), and only if they are comparable (i.e. the first test is satisfied), in a second step it is determined if one path subsumes the other. After the subsumption is determined for all (comparable) paths, only the significant paths that are not subsumed are added to the graph. The conditions upon single vertices, edges, and paths can be configured using the respective vertex, edge, or path properties. The path subsumption is also configurable and both types of constraints are discussed further in Section 8.4.
Intensional Edges While edges usually need to be added explicitly to the graph, the CGDT also supports the concept of intensional edges, i.e. edges that are (theoretically) always available between two specific vertices. Consequently, this type of edges is handled by the graph directly. Intensional edges are registered with a graph instance once with an associated precondition (over the properties of the intensional edge) that governs under what circumstances they can be applied. Then, each time a new path is determined that ends in a source vertex of an intensional edge, the graph itself determines automatically whether an additional edge needs to be added to the graph based on the intensional edge definition. Intensional edges are dealt with in Section 8.3.2.
Exploration Strategy The CGDT allows for the declarative specification of the exploration strategy. One aspect is here that the number of desired result paths can be configured, alongside with a specification of result paths. By providing an additional goal criterion, which can be stated over all properties of an intermediate or result path, the CGDT instance can rank all available paths so far and determine the “cheapest” (with respect to the goal criterion) open path, which enables an A∗ -like exploration of the search space. A path (v1 , . . . , vn ) is considered open, when the “destination” vertex vn has not been explored before. Another implemented option is based on breadth-first search, which returns the set of all reachable vertices that have not been explored so far.
8.3. Signature and Operations The operations of the CGDT are divided into a Data Definition Language (DDL) (introduced in Section 8.3.1) where the properties and the constraints are defined, and a Data Manipulation Language (DML) (introduced in Section 8.3.2) that provides generic operations for updating and querying the graph, which are used during the actual process of exploration.
173
174
8.3.1. Data Definition Language (DDL) While in SQL and related languages, the DDL has an own syntax, the DDL of the CGDT is actually the ontology language RDF [MM04] that declaratively specifies which properties exist, together with their definitions, and with the constraints how to expand the graph. In contrast to SQL, where the main notion of the schema is the table, the CGDT is based on three subschemas, i.e. a VertexSchema, an EdgeSchema, and a PathSchema. Each of them defines some properties (i.e. VertexProperties, EdgeProperties, and PathProperties) and optionally some constraints (to be discussed in Section 8.4) that guide the exploration process. Each of the subschemas can be regarded (and stored) as a table (as will be illustrated later in Section 8.6). The notions of the generic graph ontology itself (i.e. the DDL notions) are depicted in UML [Gog09b] in Figure 8.1. Figure 8.1 Basic Notions of the CGDT Ontology
1 cgdt:Vertex -Schema
1
cgdt:Edge -Schema
1
cgdt:Property cgdt:label cgdt:range
cgdt:Path -Schema
cgdt:hasV.I.C cgdt:hasE.I.C cgdt:hasP.E.C * * * conditions and cgdt:Vertexcgdt:Edgecgdt:Pathdefinitions are InsertionInsertionExtensionexpressed based Condition Condition Condition on properties
cgdt:definition
¤
cgdt:Schema ¤
*
¤
cgdt:Graph
cgdt:hasProperty
0,1
cgdt:Expression ¤ cgdt:language cgdt:specification
¤
¤
¤ cgdt:Condition
The three subschemas contain some mandatory, built-in properties: • vertex schema: id serves as key, • edge schema: id (key, internally generated and used), from and to, referring to vertices x and y for an edge (x, y). Note that from and to are not keys to allow different edges between the same vertices (e.g. several flights at the same day), • path schema: id (key, internally generated and used), from, to, front, and last, referring to a, y, p and (x, y) (the latter two referring to ids of an edge and a path, respectively) for the path p ◦ (x, y) where a is the first vertex of p.
A concrete application-specific CGDT specification then defines:
174
Chapter 8. Support for Graph-Structured Domains
175
• the names and datatypes of the additional application-specific properties of each subschema, • the definitions of the derived properties, • conditions to configure the exploration process (cf. Section 8.4). XML Views The definition of derived properties, as well as conditions, relies on an XML “view” of the affected vertices, edges, and paths. Here, two parts can be discerned: a generic part, which reflects the built-in properties and the structural aspects of vertices, edges, and paths, and an application-specific (or configured) part, which is based on the additional properties of the concrete CGDT specification. The generic part of XML views on paths is captured by this nearly-DTD (note that this nearly-DTD specifies the generic part of edges, and vertices, as well):
1 2 3 4 5 6 7 8 9 10 11 12
XML views are provided on single paths, edges, and vertices (i.e. not the complete graph is serialized). Therefore, it is sufficient to consider only the encountered edges, and vertices for the specific path, which is done here in a recursive fashion. The dynamic part of the XML view is dependent on the actual configuration of the graph: every (vertex, edge, or path) property results in an XML element that is a child element of the respective parent (i.e. vertex, edge, or path) element. Note that cgdt:pathFrom and cgdt:pathTo are represented here as the string representation of the corresponding vertex ids. ¨ This is an example of the XML view on a path that starts in “Gottingen” (identified as “GOE”) and ends in “Paris Charles de Gaulle” (identified as “CDG”):
1 2 3 4
< gdt:path xmlns:cgdt="http://www.semwebtech.org/languages/2008/cgdt#" xmlns:tg="foo://bla/thetravelgraph" > < gdt:front> < gdt:path>
175
176
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
< gdt:front/> < gdt:last> < gdt:edge> < gdt:from> < gdt:vertex> < gdt:id>GOE 1.0 < gdt:to> < gdt:vertex> < gdt:id>FRA 1.0 2009-10-27T07:17:00 2009-10-27T09:00:00 Deutsche Bahn (DB) PT1H43 45.0 < gdt:pathFrom>GOE < gdt:pathTo>FRA 2009-10-27T07:17:00 2009-10-27T09:00:00 PT1H43 45.0 < gdt:last> < gdt:edge> < gdt:from> < gdt:vertex> < gdt:id>FRA 1.0 < gdt:to> < gdt:vertex> < gdt:id>CDG 1.0
176
Chapter 8. Support for Graph-Structured Domains
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
177
2009-10-27T10:15:00 2009-10-27T13:55:00 LX PT3H40 86.69 < gdt:pathFrom>GOE < gdt:pathTo>CDG 2009-10-27T07:17:00 2009-10-27T13:55:00 PT6H38 131.69 The above XML view is used (in parts) in the remainder to illustrate the definition of derived properties and conditions. Note that it also embeds XML views of edges, and vertices. Derived Properties Derived properties are either views over other properties, or are initialized via Web data sources. For accessing Web data sources, the mechanisms discussed so far are used. Properties of paths are often defined inductively. For these, the specification of the base case (which is an edge, and thus builds upon the edge’s properties) and of the inductive step (potentially using the path and the extending edge) have to be given. Instead of giving an inductive definition, path properties can optionally be specified to be SumProperties, CountProperties, or {Min|Max}Properties, which are defined as the aggregation of the values of a specified edge property. Example 43 demonstrates the definition of derived properties in pseudo syntax to facilitate easier understanding of the basic ideas behind derived properties, while Example 44 explains the XML view-based version for one of these properties. Example 43 Consider a concrete instantiation of the CGDT tailored to the travel application scenario, where the vertices (which are the train stations and airports) have two properties, i.e. the id (which is e.g. the airport code) and the timezone. The timezone is defined by a Web data view: (timezone) ← getTimezone(id) that is dynamically evaluated for each newly inserted vertex, based on its id, if it has not been (explicitly) provided (i.e. in the case of train stations). Edges, which are the direct connections, e.g., “FRA → CDG” (Frankfurt to Paris Charles de Gaulle), have domain-specific properties code (the flight number), dept, arr (departure and arrival time wrt. the local timezone), and price. The duration is a derived property:
177
178 duration = arr – dept + from.timezone – to.timezone
The properties of the paths, from, to, dept, arr, price and duration are defined inductively. For the base case where a path is just a single edge, they have the same values as for the edge. For paths of length > 1, they are defined as follows: = front.from (built-in), dept = front.dept, = last.to (built-in), arr = last.arr, = front.price + last.price or equivalently as a SumProperty = sum[e:edge](e.price) (= sum of prices of all edges of the path) duration= front.duration + last.duration + last.dept – front.arr which equals last.arr – front.dept + from.timezone – to.timezone
from to price
Example 44 The XML view of the edge from “Frankfurt” (identified by “FRA”) to “Paris Charles de Gaulle” at the time before the derived properties are evaluated already contains all information that has been provided so far:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
< gdt:edge xmlns:cgdt="http://www.semwebtech.org/languages/2008/cgdt#" xmlns:tg="foo://bla/thetravelgraph" > < gdt:from> < gdt:vertex> < gdt:id>FRA 1.0 < gdt:to> < gdt:vertex> < gdt:id>CDG 1.0 2009-10-27T10:15:00 2009-10-27T13:55:00 86.69 LX
Note that at this point, all derived properties for the from and to vertices have been evaluated (e.g. the timezones of both Frankfurt and Paris Charles de Gaulle are initialized) and are available during the evaluation phase of the derived edge properties. Now, the definition of the edge duration that was defined in Example 43 as: duration = arr – dept + from.timezone – to.timezone
178
Chapter 8. Support for Graph-Structured Domains
179
translates into this XQuery expression (namespaces have been omitted):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
let $arr := xs:dateTime(//cgdt:edge/tg:arr) let $dept := xs:dateTime(//cgdt:edge/tg:dept) let $fromTZ := if (//cgdt:from//tg:timezone < 0) then xs:dayTimeDuration( fn:concat("-PT", fn:abs(fn:round(//cgdt:from//tg:timezone)), "H")) else xs:dayTimeDuration( fn:concat("PT", fn:abs(fn:round(//cgdt:from//tg:timezone)), "H")) let $toTZ := if (//cgdt:to//tg:timezone < 0) then xs:dayTimeDuration( fn:concat("-PT", fn:abs(fn:round(//cgdt:to//tg:timezone)), "H")) else xs:dayTimeDuration( fn:concat("PT", fn:abs(fn:round(//cgdt:to//tg:timezone)), "H")) let $duration := $arr - $dept + $fromTZ - $toTZ return $duration
After the initialization of the context variables from the XML view (Lines 1 – 16), the actual computation of the duration (Line 17) is identical to the version in pseudo code. Recall that it is possible that the element names of vertex, edge, and path properties coincide, e.g. in the XML view of the path from “GOE” to “CDG”, tg:dept and tg:arr are properties of both edges and paths. This poses no problem, since the properties can always be properly discerned by prepending the respective parent element to the XPath like in Example 44, Lines 1 and 2 (note that there, an XPath of the form “//tg:dept” would have returned the same result, since it only operates on a single edge). Constructor The constructor gid ← newGraph(rdf-spec) initializes a new CGDT instance with a given specification rdf-spec (which is an RDF specification of the desired instance; cf. Section 8.6) and returns a unique graph id.
8.3.2. Data Manipulation Language (DML) The DML is also independent from the actual application domain (similar to e.g. SQL as DML for relational databases). The modifiers allow to add items to the graph: • addVertex(id, vertex-property-name-value-pairs) adds a new vertex id with the given vertex property values,
179
180 • addEdge(from, to, edge-property-name-value-pairs) adds a new edge (from, to) with the given edge property values (and adds the source and target vertices if not yet present). In the pseudo code, a slot-based notation is used, e.g. addEdge(“FRA”, “CDG”, [dept ← “10:30”, arr ← “11:50”, code ← “LH123”, price ← “185.00”]). • addIntensionalEdge(from, to, edge-property-name-value-pairs, rdf-edge-spec) adds an intensional edge (from, to) with the given edge property values (and adds the source and target vertices if not yet present). The rdf-edge-spec describes how the dynamic properties of a new edge can be derived automatically from the context path, as well as potential conditions that have to be met in order for a intensional edge to be applicable as demonstrated in Example 45 using a pseudo syntax notation to facilitate an intuitive understanding of the salient concepts. The accessors include the following: • (var, [v1 , . . . , vn ]) ← newVerticesBFS([v1 ← attr1 , . . . , vn ← attrn ]) supports breadthfirst-exploration and binds var to the ids of the new vertices that have been added since the previous call of newVerticesBFS() and that are reachable by a path. Additionally it is possible to specify optional attributes attr1 , . . . attrn that are then bound to the corresponding variables (v1 , . . . , vn ) for each vertex, • (var, [v1 , . . . , vn ]) ← nextVertexAStar([v1 ← attr1 , . . . , vn ← attrn ]) binds var to the id of the next vertex that must be extended according to A∗ best-first-search (and a given valuation function, cf. Section 8.4). Similarly to newVerticesBFS(), also vertex attributes can be additionally bound, • finished() returns true, if all open paths are more expensive than the top–k already found result paths, and otherwise it returns false (the actual computation, if the top-k results have already been found relies on the result path specification and termination condition that is presented in Section 8.4), • (v1 , . . . , vn ) ← getResultPaths(v1 ← attr1 , . . . , vn ← attrn ) returns a binding for variables (v1 , . . . , vn ) to the corresponding attributes of each path that is considered as a result. The specification of the intended result paths is presented in Section 8.4, • dumpResultPathsAsHTMLTable() is a convenience method that generates a (sorted) presentation of the result paths as HTML table. For each result path p = (e1 , . . . , ek ), all properties are shown, as well as the edges e1 , . . . , ek and their properties. Parts of the DDL (i.e., the constructor) and the complete DML comprise the public interface of the CGDT, which is modeled as a MARS action and query language in Section 8.5. The action part of the CGDT language deals with updates of the graph, e.g. adding vertices, or edges to the graph, while the query part consists of the accessors. Example 45 During travel planning the case may occur that an intermediate location is found that is close to the destination but no means of public transportation (e.g. buses and trains) are available. In this case, an intensional edge can be added that captures the availability of taxis from this intermediate location to the final destination (illustrated here for the case of the airport in “Faro” (IATA code “FAO”) with the final destination
180
Chapter 8. Support for Graph-Structured Domains
181
“Vilamoura” as for COOPIS 2009): addIntensionalEdge(“FAO”, “Vilamoura”, [price ← “20.00”, . . . ], [dept(15, 45), cnd(arr < “22.00” and arr > “6.00”), . . . ])
The pseudo-syntax specifies that an intensional (taxi) edge between “FAO” and “Vilamoura” is available every day at either x:15 or x:45 o’clock (x ∈ {6, 7, . . . , 20, 21}) and the price for the taxi is 20e. The corresponding RDF specification is rather lengthy and does not provide additional insights into the usage scenarios for intensional edges and is thus omitted. Now, each time a connection to “FAO” is added to the graph, also a respective edge from “FAO” to “Vilamoura” (where the departure time is chosen dynamically depending on the arrival of the flight) is added. Note that, if the plane would e.g. arrive at “22:01” a “taxi” edge on the next day at “6:15” is added to the graph. (This behavior could have also been encoded in the surrounding query workflow that drives the exploration of the search space, but this solution allows for a more declarative separation of concerns.)
8.4. Configurability of the Exploration Process Although breadth-first search, best-first search and depth-first search proceed differently in the large, the configuration of the exploration process can be specified by the same notions. Therefore, it is sufficient to exemplify it for the use in breadth-first search, which illustrates the set-oriented features best by doing the expansion in parallel. Breadth-First Search The underlying principle of breadth-first search is simple and makes the strategy well-suited for graph exploration in online algorithms: starting with a set of one or more known vertices (e.g. airports), consider all edges from these vertices to any other (known or yet unknown) vertex. These edges are added to the graph, and (i) can be used to extend existing paths, and (ii) result in newly known vertices that can be used in the next step. The difference between the internal search patterns (e.g. breadth-first vs. best-first) does not affect the configuration of the behavior of the graph itself, which for all search patterns consists of conditions that specify the following: 1. insertion conditions: when a new edge is found, add it to the graph or discard it (e.g. when certain airlines or intermediate airports should be excluded), 2. path extension conditions: when a new edge is inserted: under which conditions can it be used to extend an existing path p (e.g. its departure time must obviously be later than the arrival time of p), 3. result path specification and termination: how many results are required, what constitutes a result path, and how should result paths be ranked?
181
182 By this, the CGDT separates the acquisition of edges (that must be programmed explicitly in query workflows) from the actual handling of their contributions to the graph (that is configured into the graph). Insertion Conditions For vertices and edges, conditions can be stated that need to be satisfied for insertion of the item into the graph. Vertex insertion conditions are only concerned with properties of the vertex itself (e.g. the exclusion of flights via Paris Charles de Gaulle (IATA code: CDG) because of luggage handling problems can be expressed by the condition id 6= “CDG”). Edge insertion conditions are only concerned with properties of the edge itself (e.g. duration < “10:00”), its start and end vertices. An edge is also not inserted if one of its vertices does not satisfy the vertex insertion conditions. The RDF specification of the conditions does also rely on the XML view of the vertex, or edge after all properties have been initialized as illustrated in Example 46. Example 46 The condition that no flight over CDG is desired can be captured by this XPath expression: 1
//cgdt:id != xs:string("CDG")
If now the edge from Example 44 between FRA and CDG is to be entered in the graph, first the derived properties for FRA are computed, resulting in this XML view: 1 2 3 4 5
< gdt:vertex xmlns:cgdt="http://www.semwebtech.org/languages/2008/cgdt#" xmlns:tg="foo://bla/thetravelgraph" > < gdt:id>FRA 1.0 For this vertex, the insertion condition is satisfied, and the vertex is added to the graph. Afterwards the derived properties for CDG are computed, resulting in this XML view:
1 2 3 4 5
< gdt:vertex xmlns:cgdt="http://www.semwebtech.org/languages/2008/cgdt#" xmlns:tg="foo://bla/thetravelgraph" > < gdt:id>CDG 1.0 This vertex clearly violates the insertion condition (i.e. the test evaluates to “false”) and the vertex is rejected, as well as the edge from FRA to CDG, and both are not inserted in the graph. If another edge originating or arriving at FRA is encountered, the already initialized vertex is re-used.
182
Chapter 8. Support for Graph-Structured Domains
183
Path Extension Conditions Unlike in the case of edge insertion conditions, where the question was, whether an edge should be inserted in the graph, Path Extension Conditions allow to state application-specific constraints when a new edge (x, y) can be used for extending a path p that ends in x to p ◦ (x, y). They are formulated in terms of the properties of the edge and of the path, where the base case of a path extension condition is used for paths of length 1, and the composite case governs paths of length > 1. Example 47 In a graph without adorned properties, a path between two vertices v1 and vn is admissible, if there exists a series of connected vertices (v1 , . . . , vn ). This is clearly not sufficient in the travel planning scenario: here, paths are only admissible, if there is enough time to change from one means of transport to the next, e.g. for a path ((s, . . . , x), [arr = t1 ]) and a new edge (x, y, [dept = t2 ]), the new path ((s, . . . , x, y), [. . .]) is only added if t2 – t1 > 1h, which is captured by this path extension condition (in the base case all paths satisfy this condition, since no changes are involved): ½
“true” length(p) = 1 (base case) edge.dept – path.arr > 1h length(p) > 1 (composite case)
¾
Consider an invocation of addEdge(x, y, [. . .]) (i.e. a direct connection). If the destination airport y is not yet contained in the graph, it is added as a vertex (automatically retrieving its timezone property via a Web data view). The connection itself is added as an edge with its properties, and for all paths p = (s, . . . , x), the path p′ = (s, . . . , x, y) is a candidate for insertion. If the new edge’s departure is more than one hour later than p’s arrival, p′ is actually inserted with the appropriately computed property values. The actual specification of the base and composite case is again based on the XML view of the extended new path candidates. The syntactical details are analogous to Example 46 with the additional distinction of a base and an composite case and are omitted here. If for such newly added paths ending in a vertex y, edges (y, z) are already stored, the respective extended paths (s, . . . , x, y, z) are also candidates for insertion, and so on (this is comparable to the computation of a fixpoint in deductive databases [AHV95]). Note that edges (like in the above example an edge (x, y, [dept = t3 ]) with t3 < t1 ) that cannot yet be used for extending an (already known) path p can possibly be used later for extending other paths that reach y with an earlier arrival time than p. For that, path extension conditions are usually stricter than edge insertion conditions. Vertices should only be considered as “new” to be extended in the next step if they became actually newly reachable by a path. Example 48 Consider the following situation in a plain breadth-first search. The travel ¨ starts in “Gottingen” (Germany), abbreviated as “GOE”, and the destination is “Lagos” (Algarve, Portugal). This example uses an intuitive syntax for paths of the form (s, . . . , x, [attp ]), where “s” is the start of the path and “x” is the end of the path, and [attp ] is a list of properties of the path. When a path is extended, the notation (s, . . . , x, [attp ]; x, y, [atte ]) is used to indicate
183
184 that the path (s, . . . , x, [attp ]) is extended with the edge (x, y, [atte ]) to (s, . . . , x, y), where [atte ] is a list of properties of the the edge used to extend the path. The connections are as follows: • “GOE” → “HHN” (Frankfurt Hahn) by train 8:00 – 12:30, • “HHN” → “FAO” (Faro/Portugal) 15:10 – 18:10, • “FAO” → “Lagos” (by train) 18:50 – 19:40, and
• “GOE” → “FRA” (Frankfurt Airport) 8:30 – 10:30, • “FRA” → “LIS” (Lisbon) 12:40 – 14:40, • “LIS” → “FAO” 16:10 – 17:10, and
• “FAO” → “Lagos” (by train) 17:50 – 18:40.
Breadth-first exploration starts at G¨ottingen (the problem of actually finding the nearest airports is ignored at the moment) and adds edges (“GOE”, “HHN”) and (“GOE”, “FRA”) to the graph. The new vertices are “FRA” and “HHN”. The second round will expand these, find (among many others) connections (“HHN”, “FAO”, [arr = “18:10”]) and (“FRA”, “LIS”, [arr = “14:40”]), and extend the paths accordingly. The next round expands the new vertices “FAO” and “LIS”. For “FAO”, it finds the connections (“FAO”, “Lagos”, [dept = “17:50”]) and (“FAO”, “Lagos”, [dept = “18:50”]). Only the latter of these can be used for extending the already existing path to (“GOE”, “HHN”, “FAO”, [arr = “18:10”]; “FAO”, “Lagos”, [dept = “18:50”]). Expansion via “LIS” leads to the path (“GOE”, “FRA”, “LIS”, “FAO”, [arr = “17:10”]), again reaching “FAO”, but already at 17:10. Now, if the edge (“FAO”, “Lagos”, [dept = “17:50”]) has been kept before, it can be used immediately to extend the path to (“GOE”, “FRA”, “LIS”, “FAO”, [arr = “17:10”]; “FAO” , “Lagos”, [dept = “17:50”]). Otherwise, the vertex “FAO” must be again extended, running another round of breadth-first-search, requiring another evaluation of a Web data view, yielding the edge (“FAO”, “Lagos”, [dept = “17:50]) again, and resulting in the – finally faster – connection. Example 48 shows that there are two possibilities: • when an edge is added, store it only if it can be used for extending a path (this keeps the edge (“FAO”, “Lagos”, [dept = “18:50”]), but discards (“FAO”, “Lagos”, [dept = “17:50”]) first). In this case, the vertex must potentially be processed again later, • store all added edges (that satisfy all edge insertion conditions), and extend paths appropriately, keeping unused edges for later use. In this case, newly stored vertices should only be extended in the next step if they are actually reachable by a path. In the design and implementation of the CGDT the latter combination (which costs some additional storage, but avoids running expensive external queries twice) was chosen. The accessors newVerticesBFS() and nextVertexAStar() only return those vertices that are reachable by a path. By this, paths cannot only be extended by newly found edges, but afterwards also by already stored ones. As already mentioned above, this amounts to
184
Chapter 8. Support for Graph-Structured Domains
185
computing a fixpoint, i.e. extending paths as long as possible while keeping track of the associated newly reachable vertices. Regardless of specific properties of the path, also generic insertion conditions are supported. For instance, it can be asserted that only paths that contain no cycles can be inserted to the graph, which is realized as a graph “configuration” on the RDF level – i.e. the graph instance is of type cgdt:CycleFreeGraph (note that this does only apply to paths, cf. Section 8.6 and Appendix A.1). Path Subsumption The path extension conditions are evaluated local to a single path. Yet, there is still the problem of multiple paths that are “similar”, e.g. in Example 48 the path (“GOE”, “FRA”, “LIS”, “FAO”, [arr = “17:10”]), can be extended by both train connections from “FAO” to “Lagos”, resulting in two similar paths, which follow the exactly same route but allow for different amounts of time to change to the train from the flight in “LIS”. In some scenarios this might be relevant, e.g. when it is uncertain how much time is required to go from one part of the airport to the train station or how much time it will take to pass customs. Also, the longer connection will be ranked lower by the valuation function. However, the amount of (possibly) redundant combinations increases dramatically in a real-life search for connections, and often the user is interested only in the significant results to help her reach a more well-founded decision. To reduce the returned results to the one that are significant to the user, the CGDT supports the notion of a subsumption criterion on paths. This criterion also helps to prune the search space during the exploration and further constrains the set of paths that are considered for extension. The subsumption criterion is based on two (XQuery) functions, which both operate on XML views of two paths p1 and p2 : • sameRoute(p1 , p2 ): determines, whether two paths are comparable (note that this function is “symmetric”, i.e. sameRoute(p1 , p2 ) = sameRoute(p2 , p1 )), and • pathP runable(p1 , p2 ): returns true, if path p1 is subsumed by path p2 , i.e. if path p1 is redundant. Thus, for a set of n paths, in a na¨ıve approach, both functions are evaluated n ∗ (n–1) times (since it does not make sense to compare a path to itself). Section 8.5.1 discusses some optimization strategies for reducing the number of required calls, e.g. by evaluating both pathP runable(p1 , p2 ) and pathP runable(p2 , p1 ) after the initial check of sameRoute(p1 , p2 ), the latter only needs to be invoked n∗(n–1) times (cf. Example 49). 2 Example 49 As mentioned above, the two subsumption functions sameRoute(p1 , p2 ) and pathP runable(p1 , p2 ) operate on the XML view of paths. Therefore, two variables are initially bound: $d1 is bound to the XML view of path p1 , and $d2 is bound to the XML view of p2 , accordingly. This way, arbitrarily complex subsumption criteria can be defined that only need to focus on writing the two appropriate functions. The determination of the respective path combinations is handled transparently by the CGDT as will be elaborated in Section 8.5.1.
185
186 This XQuery expression shows a subsumption criterion for the travel planning scenario:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
declare function local:pathPrunableByEarlierDeparture($d1, $d2) as xs:boolean { xs:dateTime($d1/tg:dept) = xs:dateTime($d2/cgdt:last/cgdt:edge/tg:arr) }; declare function local:pathPrunable($d1, $d2) as xs:boolean { local:pathPrunableByEarlierDeparture($d1, $d2) and local:pathPrunableByLaterArrival($d1, $d2) and not ($d1/tg:price < 0.97 * $d2/tg:price) }; declare function local:sameRouteRec($d1, $d2) as xs:boolean { $d1/cgdt:last/cgdt:edge/cgdt:from//cgdt:id = $d2/cgdt:last/cgdt:edge/cgdt:from//cgdt:id and (($d1/cgdt:front/* and $d2/cgdt:front/* and local:sameRouteRec($d1/cgdt:front/cgdt:path, $d2/cgdt:front/cgdt:path)) or (not ($d1/cgdt:front/*) and not ($d2/cgdt:front/*))) }; declare function local:sameRoute($d1, $d2) as xs:boolean { $d1/cgdt:pathTo = $d2/cgdt:pathTo and local:sameRouteRec($d1, $d2) }; declare function local:pathLessEqual($d1, $d2) as xs:integer { if (local:sameRoute($d1, $d2)) then if (local:pathPrunable($d1, $d2)) then -1 else if (local:pathPrunable($d2, $d1)) then 1 else 0 else 0
186
Chapter 8. Support for Graph-Structured Domains
42 43 44 45 46 47 48 49
187
}; (: if -1: first path is subsumed, if +1: second path is subsumed otherwise: 0 incomparable :) return local:pathLessEqual($d1, $d2)
Intuitively a path is subsumed, if it starts earlier and arrives later and is not significantly cheaper than the other path. Note how this implementation makes use of the recursive nesting of paths to determine, whether two paths can be compared to each other. The function local:pathLessEqual demonstrates how the abovementioned (initial) optimization strategy can be implemented in XQuery (cf. Lines 37 – 41). Subsumption criteria are inherently subjective, and thus the criterion shown in Example 49 is just one possibility but not “the” subsumption criterion for travel planning in general. This is just another configuration parameter that can be tweaked to best reflect the user’s interests. From an operational point of view, the CGDT operates set-oriented, i.e. a set of edges is sent to the CGDT engine for addition to a graph instance at a time. Those that satisfy the corresponding vertex and edge conditions are used to compute the candidate set of new paths. For each one of those, the path extension conditions are checked, and the set of paths that satisfies them is used as input for the path subsumption check – as a consequence only paths that are checked in the same “batch” can subsume each other. Section 8.5.1 further elaborates on the practical implications of the way these batches are determined. Result Path Specification and Termination The above conditions control how the internal information of the CGDT instance is extended when adding edges. Additionally, it must be specified, when the process ends, and preferably already during the process, only new vertices that are promising to continue the search should be selected for the next step. The end of the process can be captured by the notion of having found a sufficient number of results that match a certain goal criterion. This result specification is expressed via a filter condition which paths can qualify as intended results (e.g. those paths that end in the desired final destination), and optionally a valuation function on paths that can be seen as a cost measure and that must be strictly monotonic wrt. path extension, and an integer k, which governs how many results should be finally returned. Example 50 In the travel planning scenario, a reasonable goal criterion might rely on the duration and the total price of a result path, since these are the properties a human
187
188 user most likely would try to minimize (generally, all path properties can be used in the valuation function). Assume, the user would be willing to accept one more hour to travel in order to save 100e: fgoal = duration +
price 100
Assume that the duration is given in (metric) hours, e.g. 10.5h. Furthermore, the final destination is to be “St. Malo” and she is satisfied with the best 10 routes (i.e. k = 10). The valuation function, as well as the final destination specifications are realized on top of the XML view representation of paths, where the former leverages the expressive power of XQuery for specifying elaborate functions, while the latter just requires an XPath expression, e.g.:
1
/cgdt:path/cgdt:pathTo = xs:string("St. Malo")
When breadth-first-search is applied, paths that are “above” (i.e. more expensive) the limit of the best k results so far are not further extended, and vertices that are only reachable by such paths are not expanded. This prunes the search space as soon as k paths have been found that satisfy the filter condition, and guarantees termination. In case of A∗ search, the valuation function is used to choose the next vertex to be extended. The generic RDFS ontology of the CGDT that provides the necessary vocabulary for specifying concrete CGDT instances can be found in Appendix A.1.
8.5. Technical Realization So far, the declarative semantics of the CGDT has been discussed. In Section 8.5.1 some light is shed on the implementation of the “operational” semantics of the CGDT, while Section 8.5.2 presents the seamless integration of the CGDT into the MARS framework as an action and query language.
8.5.1. Implementation Details The CGDT is implemented as a Web Service that consists of two components: (i) an underlying database for storing the actual graphs, and (ii) the CGDT engine that translates graph specifications into database tables and that processes the abstract datatype’s DML operations according to the conditions stated in the configuration. The engine is mainly based on a Graph class. Each graph is represented by an instance of this class. Since each graph potentially has a different schema, each graph is stored separately. For each subschema, there is a database table that holds the data, and a Vertex/Edge/PathHandler instance on the Java level that manages the access operations on the table (cf. Table 8.1).
188
Chapter 8. Support for Graph-Structured Domains subschema Vertex schema Edge schema Path schema
database table vertex table edge table path table
Java classes VertexHandler EdgeHandler PathHandler
189
Graph
Table 8.1.: CGDT Implementation: Schematic Overview Processing of the DDL: Graph Specifications The Graph constructor is invoked with the graph specification that contains the three subschemata and the constraints as introduced in Section 8.3.1. A Graph instance is created with the appropriate handlers that store the definition of the derived attributes, the insertion conditions etc., and the tables are created with the attributes of the respective subschema. Processing of the DML DML operations are submitted by the Web Service to the respective Graph instance and further delegated to appropriate methods of the Handler instances. Upon insertion of a new item, first its derived attributes are computed, and then, if it satisfies all respective insertion conditions, it is inserted in the respective database table. Upon insertion of an edge, additionally, the source and target vertices are added to the vertex table if not yet present, and the paths that are candidates for extension are determined. For each of these paths, first in isolation all path extension conditions are checked, and the set of remaining paths is the input to the path subsumption check. Finally, the set of paths that “survive” both the path extension conditions and the path subsumption are inserted into the path table. The accessors newVerticesBFS() and nextVertexAStar() (implemented in the VertexHandler) select the ids of the appropriate items from the database and return them (additional internal columns for supporting the chosen search variant are maintained in the vertex table). Access to properties of certain vertices, edges, or paths via their id is also supported (implemented through their mapping as XML view). The accessor getResultPaths(slotselection-spec) (implemented in the PathHandler) selects the paths from the path table that satisfy the ResultSpecification and returns the specified set of variable bindings. In contrast, the accessor dumpResultPathsAsHTMLTable() directly generates an HTML document and stores it on the server, so the results can be viewed in a Web browser. XML Views & Path Subsumption The support for powerful access and manipulation mechanisms of graph items (i.e. vertices, edges, and paths) comes with the cost of the transformation of the (currently) relational database-based storage of the internal data structures to a main memory XML document representation. To alleviate these startup costs to some extent, a (main memory) cache is maintained that stores for each already requested graph item its associated XML view. This is especially important for checking the subsumption criteria, because here the same paths are required for many different checks.
189
190 An initial optimization strategy for the path subsumption check was already presented in Section 8.4: by evaluating both pathP runable(p1 , p2 ) and pathP runable(p2 , p1 ) whenever sameRoute(p1 , p2 ) is satisfied, the execution time is decreased, since the latter functimes (for n candidate paths). A further optimization only needs to be evaluated n∗(n–1) 2 tion that reduces the number of invocations for the sameRoute(p1 , p2 ) function, and the pathP runable(p1 , p2 ) function enormously is to keep track of each path that is reported as subsumed, e.g. assume the set of paths {p1 , . . . , p5 } where currently the subsumption check for path p1 is determined. Now, p1 is compared to paths p2 , . . . , p5 , and paths p3 and p5 are subsumed by it. Consequently, no further checks for p3 and p5 need to be computed in the future, since it is not relevant, by which path they are subsumed, only that they are subsumed. This way, many checks can be skipped at runtime. As mentioned above, the path subsumption check is based on a set of candidate paths, which in turn is dependent on the set of edges that is sent to the CGDT instance for addition. When the RelCCS process that “feeds” the graph with new edges becomes asynchronous, i.e. different branches are running in parallel, it is possible that edges that would subsume each other arrive in different sets for addition to the graph. As a result, paths may be added although they would not pass the subsumption check, if they would be in the same “batch”. This phenomenon is discussed further in Section 9, where the implementation of the travel planning scenario with an accompanying evaluation is presented. Note that by this, relevant paths are never discarded, only more paths are retained as strictly necessary. To assure that finally no unintended result paths are reported, for each invocation of getResultPaths(slot-selection-spec) and dumpResultPathsAsHTMLTable(), the path subsumption for the set of result paths is executed before the paths are actually returned.
8.5.2. CGDT as MARS Action and Query Language To employ the CGDT in the MARS framework, two different modes can be discerned: 1. opaque: in this mode, MARS is oblivious of the underlying language and the operation is a tuple-at-a-time (i.e. MARS internally loops over the set of tuples), and arbitrary language fragments can be used to interoperate with the CGDT, 2. language-aware: in this mode, the CGDT needs to be wrapped as a MARS action/query language with an accompanying entry in the Language and Service Registry (LSR, cf. Chapter 3) and an XML serialization of the accessors and modifiers. As an upside, the CGDT becomes a “full member” of the MARS language family allowing for a tight integration, i.e. (among others) the ability for automatic analysis of the input and output variable characteristics. For the embedding of the CGDT, the language-aware approach was chosen, i.e. for each accessor (≡ MARS query) and modifier (≡ MARS action), the CGDT XML DTD has a corresponding XML element definition. For the sake of completeness, this is the LSR entry of the CGDT MARS language:
190
Chapter 8. Support for Graph-Structured Domains
1 2
191
@prefix lsr: . @prefix mars: .
3 4 5 6 7 8
a mars:QueryLanguage; mars:name "Configurable Graph DataType Service" ; mars:shortname "CGDT" ; mars:isa mars:ActionLanguage; mars:is-implemented-by .
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
a mars:QueryServic, mars:ActionService; lsr:has-task-description [ a lsr:TaskDescription; lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "body" ; lsr:Subject "body" ; lsr:input "element request" ; lsr:variables "*" ; lsr:mode "asynchronous" ]; lsr:has-task-description [ a lsr:TaskDescription; lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "n.a." ; lsr:Subject "n.a." ; lsr:input "item" ; lsr:variables "n.a." ]; lsr:has-task-description [ a lsr:TaskDescription; lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "body" ; lsr:Subject "body" ; lsr:input "element execute" ; lsr:variables "*" ]; lsr:has-task-description [ a lsr:TaskDescription;
191
192
44 45 46 47 48 49 50
lsr:describes-task ; lsr:provided-at ; lsr:Reply-To "body" ; lsr:Subject "body" ; lsr:input "element response" ; lsr:variables "n.a." ].
Essentially, it is a mixture of the common methods of a MARS action and query language. The XML syntax of the accessors and modifiers, which can be used as atomic constituents of RelCCS processes (cf. Section 7.3.1), is omitted here and will be introduced in the next section with the help of a use case.
8.6. Use Case: Bacon Numbers + CGDT The phenomenon of the Bacon Numbers has already been introduced in Section 7.4.1. There, only the numbers, i.e. the shortest distance in the co-star graph, have been computed without regard for the actual connection between the results. With the help of the CGDT, the therein proposed query workflow can be adapted to also record the explicit movie connection between actors like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
"Kevin Bacon"
"Samuel L. Jackson"
< gdt:Query cgdt:name="NewGraph" >
192
Chapter 8. Support for Graph-Structured Domains
20 21 22 23 24 25
< gdt:spe ifi ation> . . .
26 27 28 29 30
kb:graph a cgdt:CycleFreeGraph; cgdt:hasEdgeSchema kb:es; cgdt:hasPathSchema kb:ps; cgdt:hasResultSpecification kb:rs.
31 32 33 34 35 36
kb:es a cgdt:EdgeSchema; cgdt:hasProperty kb:connection. kb:connection a cgdt:Property; cgdt:label "connectingMovie"; cgdt:range xs:string.
37 38 39 40 41 42 43
kb:ps a cgdt:PathSchema; cgdt:hasProperty kb:steps. kb:steps a cgdt:CountProperty; cgdt:label "steps"; cgdt:range xs:integer; cgdt:baseProperty "//cgdt:edge".
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
kb:rs a cgdt:ResultSpecification; cgdt:filterCondition [ cgdt:language ; cgdt:specification "/cgdt:path/cgdt:pathTo = xs:string(\"Samuel L. Jackson\")" ]. ]]>
193
193
194
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
< gdt:A tion cgdt:name="DumpResultPathsAsHTMLTable" />
< gdt:A tion cgdt:name="AddEdge" >
194
Chapter 8. Support for Graph-Structured Domains
106 107 108 109 110 111 112 113 114 115 116 117
195
< gdt:properties> < gdt:property cgdt:name="connectingMovie" cgdt:useVariable="movie" />
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138
< gdt:Query cgdt:name="NewVerticesBFS" />
Like in Section 7.4.1, the whole process is a sequence, but this time only the input variables from (the starting actor – “Kevin Bacon”) and to (the destination actor – “Samuel L. Jackson”) are bound. The helper variable steps is not required here, but is modeled as a path property in the CGDT instance. Next, a query is issued that initializes the CGDT instance ( ..., Lines 19 to 53). The specification of the graph is provided wrapped inside ... (Lines 20 to 52) as an RDF graph that uses the notions of the CGDT ontology (cf. Section 8.3 and Appendix A.1). This “blueprint” is used to initialize the respective database tables as side-effect and the query returns a graphID that is
195
196 used to uniquely identify the instance later in the process. The created CGDT instance will only insert those paths that contain no cycles (indicated by the type of the graph: cgdt:CycleFree ), and has one edge property ("connectingMovie" ) that stores the name of the movie two actors acted in together and one automatically derived path property ("steps" ) that counts the number of edges in the path (i.e. determines the Bacon number). The associated result specification – the path has to end with the “destination” actor “Samuel L. Jackson” – is here used only to determine the relevant result paths. After the graph initialization “query” the definition of the recursive process fragment that drives the exploration process starts on Line 55, and then – similar to the variant without the CGDT – the actual execution starts by calling foo:nextStep (Line 135) with the local variable inActorl bound to the value of the variable from (here “Kevin Bacon”). The other local variables of the recursive definition will be bound during later steps. The execution of the process continues on Line 60 with an Alternative and the initial tuple t1 := (inActorl /“Kevin Bacon”, from/“Kevin Bacon”, to/“Samuel L. Jackson”, graphID/“23”). The variables local to the recursive process definition are again depicted with the subscript “l” to distinguish them from the “globally” bound variables. The first alternative branch (Line 61) checks whether at least – indicated by the quantifier exists – in one tuple the variable inActorl has the same value as the variable to, which is not the case at the beginning. Note that while in Section 7.4.1 the test was evaluated “opaque” in XPath, here the test is evaluated by using the internally provided comparators in the ops namespace (Line 63 to 66, also cf. Section 3.4). For the second branch (starting on Line 71), which requires that no tuple satisfies the condition of the first branch, the test is satisfied and the process continues with the evaluation of the two “chained” Deep Web views as described in Section 7.4.1 (Line 78 to 95). Afterwards, as an optimization, a projection onto the variables (inActorl , outActorl , moviel , graphID, to) is computed and duplicates are eliminated (Lines 96 to 103). Note that the sequence at the beginning also removes duplicates as indicated by the modifier ccs:duplicates=”no” in Line 71. Since the “persistent” data is kept in the graph itself, the process can be relieved from additional variable bindings as soon as they are no longer necessary. The . . . action (Lines 105 to 112) adds the newly found starting edges to the graph (the graphID is also passed as an input variable, but this can be determined automatically since the CGDT MARS language supports the analyze-variables task as highlighted in Section 8.5.2), resulting in the new edges and paths that are shown in Table 8.2. There, for the edge table the property "connectingMovie" has been passed as a parameter explicitly (cf. Line 109 and 110), and the derived property "steps" has been computed automatically. Afterwards another projection onto the variables (graphID, to) is computed (line 111 to 115), and the graph is queried for new vertices according to the breadth-first search pattern1 (line 120 to 122). After another projection with duplicate removal (line 123 1
The A∗ search pattern would just non-deterministically return an arbitrary vertex, since no valuation function has been specified.
196
Chapter 8. Support for Graph-Structured Domains
197
id
from
to
connectingMovie
e1 e2 e3 e4 ...
“Kevin Bacon” “Kevin Bacon” “Kevin Bacon” “Kevin Bacon” ...
“Sarah M. Gellar” “Forest Whitaker” “Kyra Sedgwick” “Marisa Tomei” ...
“The Air I Breathe” “The Air I Breathe” “Loverboy” “Loverboy” ...
id
from
to
front
last
steps
p1 p2 p3 p4 ...
“Kevin Bacon” “Kevin Bacon” “Kevin Bacon” “Kevin Bacon” ...
“Sarah M. Gellar” “Forest Whitaker” “Kyra Sedgwick” “Marisa Tomei” ...
nil nil nil nil ...
e1 e2 e3 e4 ...
1 1 1 1 ...
Table 8.2.: CGDT contents after 1st round, edge table (top) – path table (bottom) to 128), the recursive definition is invoked (line 129), with the local variable inActorl initialized with the obtained vertices from the graph (denoted as newActorsl ) (while the local variables moviel , outActorl , and newActorsl remain unbound at the beginning of the recursion – resulting in the bound variables (inActorl , to, graphID), which are not the same bound variables that were present for t1 , since the “source” actor, identified by the variable from has been projected away in line 96 to 103). In this round, the test whether there is a tuple where inActorl = to fails as well, and the same extension step as above is performed. The details of this step are a straightforward variation of what has already been presented in Section 7.4.1 and are thus omitted. The contents of the CGDT after the second round are shown in Table 8.3. Among the new vertices that are available is now “Samuel L. Jackson”, and the next iteration is thus invoked with – amongst others – the tuple (inActorl /“Samuel L. Jackson”, to/“Samuel L. Jackson”, graphID/“23”). This time, the test of the second branch (line 72) fails, while the test of the first branch (line 62) succeeds since there exists some (i.e. the above) tuple whose inActorl is the intended destination actor to. Then, the action is executed, which generates and stores an HTML result table (cf. Table 8.4), and execution stops. Clearly, it would have been possible to use the graph to decide whether the result has already been found (i.e. by calling the accessor finished() that returns “true” once the desired goal criteria is met). Yet, for the benefit of a better comparison with the “original” query workflow of Section 7.4.1, the termination was handled in the query workflow by an explicit test. Nonetheless, the proposed solution to the travel planning scenario in Chapter 9 makes heavy use of (almost) all capabilities of the CGDT that have been discussed in this chapter and relies on the CGDT to determine the termination of the associated query workflow.
197
198 id
from
to
connectingMovie
e1 e2 e3 e4 ...
“Kevin Bacon” “Kevin Bacon” “Kevin Bacon” “Kevin Bacon” ...
“Sarah M. Gellar” “Forest Whitaker” “Kyra Sedgwick” “Marisa Tomei” ...
e234
“Sarah M. Gellar”
“Samuel L. Jackson”
...
...
...
“The Air I Breathe” “The Air I Breathe” “Loverboy” “Loverboy” ... “Quantum Quest: A Cassini Space Odyssey” ...
id
from
to
front
last
steps
p1 p2 p3 p4 ... p345 ...
“Kevin Bacon” “Kevin Bacon” “Kevin Bacon” “Kevin Bacon” ... “Kevin Bacon” ...
“Sarah M. Gellar” “Forest Whitaker” “Kyra Sedgwick” “Marisa Tomei” ... “Samuel L. Jackson” ...
nil nil nil nil ... p1 ...
e1 e2 e3 e4 ... e234 ...
1 1 1 1 ... 2 ...
Table 8.3.: CGDT contents after 2nd round, edge table (top) – path table (bottom)
8.7. Related Work The notions of online algorithms [Alb03] in general and dynamic graph algorithms [EGI99] cover a broad spectrum of aspects. This includes scenarios where the current situation is completely known, but changes, as well as situations where the underlying situation is actually static, but is not completely known and is processed incrementally, like dynamic search algorithms. The CGDT is tailored to the special, but still very common case where exploration is dynamic, but monotonic: vertices and edges once added to the graph will remain unchanged for the time the algorithm runs. The underlying graph is also dynamic, but every run is based on a (non-transactional) snapshot that is explored dynamically. Online algorithms over unknown graphs are investigated by many authors under different aspects (total exploration [DP90], search, etc.). For path search, breadth-first and best-first search by A∗ (see e.g. [RN03] for an overview) are the most prominent ones. There are several variations and improvements on A∗ , e.g. where the classic A∗ algorithm finds solutions that are simple paths, the LAO∗ [HZ01] algorithm can find solutions with loops. So far, for the considered application scenarios, breadth-first and best-first search have proven sufficient. Yet, since the concrete algorithm that is used for searching is not hard-wired in the CGDT but is provided as an accessor (e.g. newVerticesBFS()), the CGDT can easily be extended with support for other graph search algorithms, if required (by adding a corresponding new accessor to the graph, e.g. nextVertexLAOStar()). Algorithmic
198
Chapter 8. Support for Graph-Structured Domains #
199
Result Path from
to
steps
“Kevin Bacon”
“Samuel L. Jackson”
2
Path consists of the following edges: 1
...
#
from
to
connectingMovie
1
“Kevin Bacon”
“Sarah M. Gellar”
2
“Sarah M. Gellar”
“Samuel L. Jackson”
“The Air I Breathe” “Quantum Quest: A Cassini Space Odyssey”
...
Table 8.4.: CGDT HTML result table aspects of searching in a completely known graph can sometimes be transferred to online algorithms, e.g. [GH05] consider shortest path search using a bounding box based on a multiple of Manhattan Distance to prune the search space. Concerning the travel planning case, such pruning either does not help much in the general case: for short distances, the best solution may use a hub outside the scope (e.g. traveling from Japan to Eastern Russia in some cases involves a stop in Moscow). For long distances (or appropriately high overestimations of the bounding box) the search space amounts to the whole earth. Yet, by using the TopK operator over a distance between cities (as illustrated in Section 9) a more dynamic pruning of the search space can be achieved that is actually dependent on the runtime data and that does not require to pre-specify a fixed radius up front. Also, research on composition of Web Services like [RK07, MS02] is a related area, but in general deals with a higher level of abstraction where the concrete modeling and algorithmic handling of the data is not described. Such approaches can be complemented with the use of the CGDT, since it declaratively covers the data-oriented aspects, as shown in the sample processes in Section 8.6 and Chapter 9. Most works on graph schemas have a different goal, namely to describe a graph-based data model in the sense of semistructured data like RDF on the schema level by the labels of its vertices and edges. In these languages, the graph is not part of the domain and it is not used for applying graph algorithms, but the domain is modeled as a graph which is updated and queried. Many approaches in this area are built upon the idea of graph transformations [Sch98, SWZ99, EH00, NNZ00, BH02, Tae03, FM07, FMRS07]: there, the underlying domain, e.g. a bank account, is modeled as a graph and the queries are modeled as graph patterns, while the updates are represented as graph transformations. Graph transformations have a left-hand side (the pre-conditions, i.e. a graph pattern), and a right-hand side (the post-conditions, i.e. the replacement pattern that can address the attributes occuring in the left-hand side), which allows for an intuitive visual representation of transformations. The Progres language [SWZ99] allows – like the
199
200 CGDT – to assign (optionally derived) attributes not only to vertices, but also to edges and paths. Paths are seen as derived edges that are declared in a rule-based way. Other prominent graph transformation languages are AGG [Tae03], which allows Java objects as value types and also features Java expressions, and Fujaba (From UML to Java and back again) [NNZ00], where the focus is on providing users with a visual programming language for roundtrip-engineering support for UML and Java. A more comprehensive comparison of these three languages can be found in [FMRS07]. Graph transformation languages are used for many different applications, e.g. for modeling knowledge representation languages [Sch98], system modeling and model evolution [EH00], as well as for the refactoring of UML [FM07]. Overall, graph transformation is an expressive general purpose formalism for modeling complex application domains with an intuitive visual representation. The CGDT is more narrow in scope, and at its heart is the modeling and management of (graph-structured) instance data in query workflows. For this, different notions are required, e.g. graph transformation languages are not concerned with finding a set of new vertices guided by a valuation function over (a subset of) materialized paths, but are designed for retrieving whole subgraphs via graph search patterns. From a more practical point of view the question arises, whether existing libraries, such as LEDA [MN89], can be used as a basis for an efficient implementation. The current implementation is divided into two parts: the controller logic is implemented in Java, while the actual graph instance data is kept in a relational database (cf. Section 8.5.1), which so far has proven to be an appropriate choice. This is one of the strong points of the declarative semantics of the CGDT that is not bound to a specific implementation. Consequently, it can easily be implemented in an arbitrary (reasonably expressive) programming language without needing to change a single line of an existing query workflow. In line with the above argumentation, the CGDT can be seen as a domain-specific language (DSL) [vDKV00]. In [vDKV00], a DSL is defined as follows: “A domain-specific language (DSL) is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain.” With this respect, the CGDT is a DSL for the declarative modeling of graph-structured domains, which leverages the expressive power of XPath and XQuery as sublanguages, e.g. for the definition of derived properties. As such, it shares their major benefits: 1. a concrete graph definition can be expressed using common graph terminology and in terms of the notions of the problem domain (e.g. the in the travel planning scenario, notions such as the price of a transitive connection, i.e. a path, can be captured intuitively), and 2. the resulting CGDT specifications are concise, mostly self-documenting, and can be reused (to some extent) in different scenarios. As a downside, it encounters the same cardinal challenges (cf. [vDKV00]): 1. new users are not familiar with the notions of the CGDT and need to be educated
200
Chapter 8. Support for Graph-Structured Domains
201
in order to use it correctly, and 2. the CGDT is currently interpreted at runtime, i.e. once the CGDT engine reads a specification it generates the relevant database tables and the XPath and XQuery expressions are evaluated by an external engine, which is less effective than hardcoding a solution for a specific scenario. The first challenge is addressed by designing the query workflow and the associated CGDT specification in cooperation by a domain expert and a skilled process designer. Besides, with the growing proliferation of the Semantic Web, it is to be expected that the number of users being comfortable with modeling applications in RDF will increase, which is one of the major prerequisites in translating the informal domain notions into the CGDT ontology terminology. The second challenge can be met incrementally – currently, the focus is on a proof of concept, and the assumption is that the user is willing to wait a reasonable amount of time for high quality results. But initial experiences have already led to some improvements, such as caching of XML views, etc. (cf. Section 8.5.1). A hard lower barrier for the online exploration of graphs is the response time of the queried Web data sources, which cannot be influenced (even local caching has limits for sources with high volatility). Finally, the CGDT offers a generic solution for graph-structured domains, while a hard-coded solution only applies to a specific setting.
201
202
202
Chapter 9. Application Scenario: Travel Planning I love to travel, but I hate to arrive. – Albert Einstein (March 14, 1879 – April 18, 1955)
Contents 9.1. Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.2. Web Data Source Overview . . . . . . . . . . . . . . . . . . . . 205 9.3. The Graph Configuration
. . . . . . . . . . . . . . . . . . . . . 206
9.4. The Exploration Process . . . . . . . . . . . . . . . . . . . . . . 208 9.5. The Exploration Process (Revisited) . . . . . . . . . . . . . . . 212 9.6. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . 214
9.1. Introduction Consider the problem to find either the cheapest or shortest (in terms of total time spent traveling) route to a given location (e.g. for a conference travel) or a combination of both. Human, manual search usually employs some kind of intuitive strategy. Roughly, the strategy is to start with considering a known set of airports near the hometown and to try to cover as much distance as possible by plane (assuming the distance is above a certain threshold), and then bridge the remaining distance by train or bus; if this fails, do backtracking. This shows that, although human problem solving usually considers one possibility (= tuple) at a time, in this case it is inherently based on a set-oriented model. With the means of the presented approach, such tasks can be formulated as query workflows. The backtracking is here replaced by a search strategy where the search space is explored stepwise and pruned based on intermediate results. While for train connections, sources usually are able to return transitive connections, flight portals only return transitive connections over the flights of the same airline. Thus, here an actual graph exploration is required. An important aspect that is typical for this application scenario (and many other ones) is that the search is subject to additional constraints, like arrival and departure times and required time for changing.
203
204 Pitfalls Experiences concerning conference travels reveal that travel agencies are often challenged with finding optimal connections to less standard destinations (e.g. St. Malo/France as for ICLP 2004). Figure 9.11 shows a set of possible connections from the ¨ then valid starting point “Gottingen” (denoted by a star) to the destination “St. Malo” (also denoted by a star) – solid lines indicate train connections, and dashed lines indicate flights: ¨ 1. the obvious choice would be to take the train connection “Gottingen” → “Frankfurt” (“FRA”) → “Cologne” (“CGN”) → “Brussels” (“BRU”) → “Paris Est” → “Paris Montparnasse” → “Rennes” (“RNS”) → “St. Malo”. While being a valid connection, it is far from being optimal and runs the risk of many changes, and then potentially missing the last connection from “RNS” to “St. Malo”, 2. another choice would be to check (in parallel) flights leaving from “Hanover” (“HAJ”), or “FRA” (two of the nearest airports to the starting point) to “Paris Charles de Gaulle” (CDG) and then fly to “RNS” (or take the TGV alternatively), 3. the travel agency also seemed to follow the heuristics “choose an arbitrary wellknown airport in France” and suggested a flight to “Bordeaux” (“BOD”), 4. a fourth choice, which cannot be found by a pure “tunnel search” strategy, is to use the unexpected intermediate airport “London Stansted” (“STN”) to reach the rather unknown small airport at “Dinard” (“DNR”), which is less than 10km from St. Malo, and only reachable via the UK, or even 5. to fly to “Jersey Island” (“JER”), which is one of the (British) Channel Islands, and take the ferry to “St. Malo”. The above connections are only an excerpt of the whole search space, which theoretically grows exponentially (at each newly reached intermediate stop, a number of new choices are available, resembling a graph rooted at the starting point). As connections (4) and (5) suggest it would not even be advantageous to try to save time by predefining the set of destination airports by the user, but to use a fully algorithmic search that is not biased in any way. To demonstrate that the approach proposed in this thesis outperforms the common travel agency (which was at that time e.g. not able to come up with the connections (4) and (5) at all) the evaluation results in Section 9.6 – as well the concrete configuration of the CGDT and the process for that matter – process the same setting. The expected answer is the set of k best alternatives (wrt. a weighted function of price and duration), where each solution contains the actual connection data (e.g. departure/arrival times). Chapter Structure At the beginning of designing a query workflow, the required Web data views need to be acquired and selected, which is discussed in Section 9.2. The next step is to model the search space by providing an appropriate specfication of the CGDT, potentially using Web data views for the automatic initialization of (vertex) properties, dealt with in Section 9.3. Finally, the query workflow needs to be specified (as RelCCS process) that drives the exploration process, which is explained in Section 9.4 assuming 1
The underlying base map is taken from http://maps.google.com.
204
Chapter 9. Application Scenario: Travel Planning
205
Figure 9.1 Overview of possible connections between G¨ottingen and St. Malo
•
•
STN
⋆ •
BRU
•
DNR
JER
• ⋆St. Malo ••
P. Montp.
¨ Gottingen
CGN
•• •
HAJ
FRA
• CDG • • P.Est
Rennes
•
BOD
an ideal world, i.e. where Web data sources provide complete coverage of available connections. Section 9.5 discusses the applicability of the approach in a real-world setting, and Section 9.6 concludes this chapter with a presentation of experimental results.
9.2. Web Data Source Overview For the travel planning scenario, the following six views are relevant: 1. (deptTime, arrTime, duration, price) ← getConnectionByDate(start, dest, date), which is the Deep Web view that has been introduced in Example 23 on page 74. It returns all (train) connections from start to dest on a specific date, 2. (connectionTo) ← getConnectedAirports(iataCode) is a derived view based on the SPARQL view getAirportInfoByIATACode that was introduced in Section 6.2 and that encompasses all available data (including the length of the runway) pertaining to a specific airport. The view getConnectedAirports returns the IATA codes of all airports
205
206 that can be reached from the provided input iataCode, 3. (gmtOffset) ← getTimezone(iataCode), which also is a derived view and relies on the same base view getAirportInfoByIATACode and given an IATA code returns the associated GMT offset, 4. (ap) ← getAirports() is realized as SPARQL view that returns the IATA codes of all airports world-wide, 5. (departureTime, arrivalTime, totalFare) ← getFlights(departure, arrival, goingDate), which is a Web Service view based on http://www.travenjoy.com that can be queried for available flights between airports departure and arrival at a given date, and reports departure, arrival time, and price, 6. (dist) ← distance(a, b) is not strictly a Web data view, and is realized as a pseudo Web Service, where the actual implementation is not provided remotely but locally as a Java method, which from the viewpoint of the framework is transparent. It computes the spherical distance between two points a and b (assuming that both a and b are unique identifiers, e.g. IATA codes). The upcoming sections leverage these six Web data views for the specification of the CGDT instance, and the query workflow that drives the graph exploration.
9.3. The Graph Configuration The CGDT distinguishes three different schemata: the VertexSchema that governs the properties and insertion conditions of vertices, the EdgeSchema that defines the properties and insertion conditions of edges, and the PathSchema that specifies the properties and extension conditions of paths. Additionally, the ResultSpecification needs to be provided, in order for the graph to be able to automatically determine result paths. All definitions in this section are given in the pseudo syntax introduced in Chapter 8 for reasons of easier legibility and concise presentation. Vertex Schema The vertex schema has no insertion conditions and has one user-defined property timezone of type xs:double that is dynamically set after the vertex is inserted by evaluating the Web data view (3) from Section 9.2 (cf. Example 43 on page 177). Additionally, it has the predefined property id. Edge Schema The edge schema consists of seven properties (cf. Example 43 on page 177), of which four are user-defined: 1. id: the id of the edge (only used internally and predefined), 2. from: the origin of the connection (predefined), 3. to: the destination of the connection (predefined), 4. dept: the departure time wrt. to the local timezone (of type xs:dateTime), 5. arr: the arrival time of the connection wrt. to the local timezone (of type xs:dateTime),
206
Chapter 9. Application Scenario: Travel Planning
207
6. price: the price of the connection (of type xs:double and in the currency e), and 7. duration: the duration of the connection, which is a derived property and is defined as: duration = arr – dept + from.timezone – to.timezone. The –, and + operations here are interpreted for the datatype of the duration, which is xs:duration. The edge schema has also no insertion conditions in this case. Path Schema The path schema has nine properties, of which four are user-defined (cf. Example 43 on page 177): 1. id: the id of the path (only used internally and predefined), 2. from: the origin of the path (predefined), 3. to: the destination of the path (predefined), 4. front: reference to the path that has been extended to this path (predefined), 5. last: reference to the “last” edge in the path (predefined), 6. dept: the departure time of the whole path defined as dept = front.dept. The data type is the same as for the edge departure: xs:dateTime, 7. arr: the arrival time of the whole path is defined analogously as arr = last.arr, also of type xs:dateTime. Note that while the departure time is passed along from the path of length n–1 to the path of length n (since it does not change), the arrival time of the path is determined by looking at its last edge, 8. price: the price is defined as an aggregation property of the form sum[e:edge](e.price) (= sum of prices of all edges of the path) and is of type xs:double, and 9. duration: the duration of the path is defined as duration = front.duration + last.duration + last.dept – front.arr and is of type xs:duration. The intuition here is that the duration of the path of length n is dependent on the duration of the path of length n–1, the duration of the latest edge added to the path plus the time required for waiting for the departure time of the last edge in the path2 . Additionally, it has one path extension condition that makes sure that there is enough time to catch the next connection (e.g. to change between flights, or from a flight to a train, etc.). Here, one hour is chosen (the details have already been highlighted in Example 47 on page 183 and are omitted here). To avoid result paths that are “similar”, the subsumption criterion of Example 49 (page 185) is used for the path subsumption check. The graph is configured such that only acyclic paths are added (cf. Section 8.4) – from a theoretical point of view the process would still terminate, since cyclic paths will be ranked lower by the graph-internal top-k criteria (cf. result specification below) but it speeds up processing and avoids meaningless paths. Result Path Specification and Termination The result specification is identical to the one presented in Example 50 on page 187 (note that the configuration used for the benchmarks 2
Note that last.dept and front.arr are both of type xs:dateTime, i.e. the case when a connection would take longer than 24 hours would not result in erroneous results.
207
208 – cf. Section 9.6 – searches for the top-35 results, instead of top-10).
9.4. The Exploration Process For illustration purposes, the evaluation of the process is described in this section assuming an ideal world situation where the Web data views (1) and (5) from Section 9.2 provide complete coverage of the available connections, while in Section 9.5 the actual, refined process is described that underlies the experimental evaluation (cf. Section 9.6). The workflow is started for given start, dest, and date, and proceeds as follows, implementing a breadth-first-search for the top-k shortest paths: if the overall distance is less than 400km, only train connections are searched for. Otherwise, train connections (for less than 1000km) and flights are investigated. For the latter, the 10 nearest – i.e. the top-10 – airports to the starting point are selected by first querying view (4) for all airports, then for each airport the distance to the starting point is determined with view (6), and the 10 nearest airports to the starting point are retained by using the TopK operator. For these, the train connections are queried and added to the graph (with departure time, arrival time, and price). With this situation, the iterative exploration starts: the remaining distance is bridged by connecting flights and, if necessary, a final train connection. For all vertices x (airports) that are relevant for the next exploration step (i.e. became reachable by a possible connection in the previous step), the following is done: • if the remaining distance from x to the final destination is below 100km, an intensional edge is added that captures the availability of taxis (cf. Example 45 on page 180), • if the remaining distance from x to the final destination is below 500km, train connections from x to dest are searched, and also added as edges to the graph, • if the remaining distance from x to the final destination is more than 200km, all airports y connected by directed flights from x (obtained from a query against view (2)) are selected. For each such connected pair (x, y), the view (5) is queried for available flights. Each connection is added as an edge to the graph. The graph will extend only paths with such an edge that satisfy the path extension condition that the connection leaves at least one hour later than arrival at x. Note that if y is already in the graph and outgoing edges from y are already stored in the graph, further path extension is immediately applied for these. With this, new vertices are added to the graph, and the iteration continues. The result specification guarantees that the process terminates when 10 connections to the destination have been found, and all “open” paths are already more expensive than the best 10 known paths. This guarantees that e.g. better paths over four steps are considered even if a more expensive connection has been found over three steps. The RelCCS Process The above strategy is implemented as a query workflow in RelCCS. While in Section 7.4.1 and 8.6 for the Bacon Number use case the XML syntax
208
Chapter 9. Application Scenario: Travel Planning
209
of RelCCS has been used, here an intuitive pseudocode notation is preferred. Also, the upcoming process focuses on the main aspects of the graph exploration. The main reason is that the complete RelCCS process that was used for the benchmarks in Section 9.6 has approximately 1450 lines (vs. the 138 lines of the Bacon Numbers + CGDT use case), resulting in 34 pages of XML code. The process uses variables start and dest (destination), date (which are initialized when calling the process), ap (relevant near airports), dist (distance to airport), rd (remaining distance), dt and at (departure and arrival time), pr (price), and gid (graph id). The prefixes ccs and cgdt are used to indicate the respective languages. The iteration is encoded into a recursive process definition runGraph with local variables il , jl , rdl , dtl , atl , and prl . The main aspects of the RelCCS process are shown here: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# process input: (start, dest, date) ccs :Seq(ccs :Query(rd ← distance(start, dest)), ccs :Query(gid ← cgdt :newGraph(rdf-spec)), # the config spec discussed above ccs :Union( ccs :Seq(ccs :Test(rd < 1000), # consider to go by train ccs :Query((dt, at, pr) ← getConnectionByDate(start, dest, date)), # view (1) cgdt :addEdge(gid, start, dest, [dept ← dt, arr ← at, price ← pr])), ccs :Seq(ccs :Test(rd ≥ 400), # consider also to use flights ccs :Query(ap ← getAirports()), # all known airports, view (4) ccs :Query(dist ← distance(start, ap)), # compute distance, view (6) ccs :TopK(10, 100, 100, dist, xs:double, asc, true), # top-10 nearest airports ccs :Query((dt, at, pr) ← getConnectionByDate(start, ap, date)), cgdt :addEdge(gid, start, ap, [dept ← dt, arr ← at, price ← pr]), ccs :Projection(gid, start, dest, date), ccs :Distinct, ccs :CallProcess(runGraph[ ]))), cgdt :dumpResultPathsAsHTMLTable()), ccs :ProcessDefinition(runGraph[local : il , jl , rdl , dtl , atl , prl ]) := # global vars gid, start, dest, date are known ccs :Seq( ccs :Query(il ← cgdt :newVerticesBFS(gid)), # consider newly reached places ccs :Query(rdl ← distance(il , dest)), ccs :Union( ccs :Seq(ccs :Test(il = dest)), # no recursive call; in this case → return ccs :Seq(ccs :Test(rdl < 100 ∧ il 6= dest) # reach destination by taxi cgdt :addIntensionalEdge(il , dest, [. . .])) ccs :Seq(ccs :Test(rdl < 500 ∧ il 6= dest) # reach destination by train ccs :Query((dtl , atl , prl ) ← getConnectionByDate(il , dest, date)), cgdt :addEdge(gid, il , dest, [dept ← dtl , arr ← atl , price ← prl ]), ccs :Projection(gid, start, dest, date), ccs :Distinct, ccs :CallProcess(runGraph[ ])), ccs :Seq(ccs :Test(rdl ≥ 200), # try to get even nearer by flight
209
210
32 33 34 35 36 37 38
ccs :Query(jl ← getConnectedAirports(il )), # view (2) ccs :Query((dtl , atl , prl ) ← getFlights(il , jl , date)), # view (5) cgdt :addEdge(gid, il , jl , [dept ← dtl , arr ← atl , price ← prl ]), ccs :Projection(gid, start, dest, date), ccs :Distinct, ccs :CallProcess(runGraph[ ]))))) # postcondition: connections to the destination, either by train or # train + flight+ , or train + flight+ + (train|taxi)
The workflow proceeds stepwise, set-oriented: consider to start the workflow with the ¨ single tuple (start/“Gottingen” , dest/“St. Malo”, date/“27.10.2009”). The query for the re¨ maining distance (Line 2) extends the tuple with the fresh variable rd to (start/“Gottingen” , dest/“St.Malo”, date/“27.10.2009”, rd/“813”). It then starts two branches, one for a trainonly travel (since rd < 1000, Line 5), and one that includes consideration of flight connections (since rd also ≥ 400, Line 8); the results of both are stored in the graph and will be reported at the end. ¨ The first branch adds multiple direct train connections from “Gottingen to “St. Malo” to the graph. Consider the edges e4 – e8 that are shown in Table 9.1. Each one of them results in a candidate path of length 1. Since all of these edges (and others as well) were added in the same “batch”, they are subject to the path subsumption check. Let pe4 , . . . , pe8 be the associated path candidates for edges e4 , . . . , e8 . Since these paths are all comparable to each other, the description of the calls to sameRoute(p1 , p2 ) is omitted, and the discussion starts with the check pathP runable(pe4 , pe5 ). Recall, that the subsumption criterion from Example 49 (page 185) is used, which intuitively states that a path is subsumed, if it starts earlier and arrives later and is not significantly cheaper; pe4 starts earlier than pe5 but also arrives earlier than pe5 and pe4 is thus not subsumed by pe5 (note that pe5 is also not subsumed by pe4 since it does e.g. start later). Eventually, the check pathP runable(pe6 , pe7 ) is evaluated: pe6 does start earlier and arrives later than pe7 and it is only marginally cheaper, and pe6 is thus subsumed by pe7 and not added to the graph. It can easily be verified that none of the other paths is subsumed and they are all inserted into the path table accordingly (cf. Table 9.2 on page 217).
Now, observe the second branch: it will first evaluate a query for all known airports (Line 9, which results in a set of (hundreds of) tuples, where each is extended in the subsequent step (Line 10) with the distance between start and the respective airport, i.e. ¨ , dest/“St.Malo”, date/“. . . ”, rd/“813”, ap/“FRA”, dist/“250”), {(start/“Go.” ¨ , dest/“St.Malo”, date/“. . . ”, rd/“813”, ap/“HAJ”, dist/“100”), (start/“Go.” ¨ , dest/“St.Malo”, date/“. . . ”, rd/“813”, ap/“MLH”, dist/“550”), (start/“Go.” ... ¨ , dest/“St.Malo”, date/“. . . ”, rd/“813”, ap/“JFK”, dist/“6230”), . . . } (start/“Go.”
The following TopK step (Line 11) keeps the 10 nearest ones, amongst them “FRA” (Frankfurt), “HAJ” (Hanover) and “MLH” (Basel-Mulhouse), but e.g. not “JFK” (New York). For
210
Chapter 9. Application Scenario: Travel Planning
211
these, the next step (Line 12) looks up (multiple) train connections for each of them, binding the variables dt, at, and pr, and for each tuple, the connection is put in the graph (Line 13 – the destinations are also added to the vertex table, including the lookup of the timezones), and the tuples are projected back down to (start, dest, date) and duplicates ¨ , dest/“St.Malo”, date/“27.10.2009”) removed (Line 14), i.e. only the single tuple (start/“Go.” remains). Then, the process definition for runGraph is invoked (Line 15). In its first step, the new vertices, which are the 10 nearest airports, are retrieved from the graph and bound to the variable il (= intermediate, Line 20): ¨ , dest/“St.Malo”, date/“27.10.2009”, il /“FRA”), {(start/“Go.” ¨ , dest/“St.Malo”, date/“27.10.2009”, il /“HAJ”), . . . } (10 tuples) (start/“Go.”
The subsequent iterations extend the graph in parallel (i.e. set-oriented for all tuples) breadth-first search until connections to dest are found. This process is strongly based on the configured functionality of the graph service. Recall, that when a reached airport is less than 500km away from the destination (these are e.g. the airports “RNS” (Rennes, 60km), “DNR” (Dinard, 10km), “JER” (Jersey, 60km), and “CDG” (Paris CDG, 320km), cf. Figure 9.1), train connections to St.Malo are investigated; when it is closer than 100km to the destination (as is e.g. the case for “RNS”, “DNR”, or “JER”), taxi connections are also considered (as intensional edges). Tables 9.1 and 9.2 (see page 217) illustrate some of the vertices, edges, and paths stored in the graph and their evolution. The sample shows that an edge can be used to extend multiple existing paths (“CDG” [14:40] – “StM” (“St. Malo”) [19:20] by train (edge e74 ), “CDG” [16:40] – “RNS” [17:50] by plane (edge e75 )) (i). Amongst the first solutions found (after three steps) there is the fastest one, but not the cheapest one (ii). (iii) illustrates that when “RNS” is expanded for the first time, the respective (intensional) taxi edges (“RNS” [15:30] – “StM” [16:20] (edge e97 ) and (“RNS” [19:30] – “StM” [20:20] (edge e98 )) are immediately appended to all paths that end in “RNS”. So, these four-step paths are already created in the third round. To guarantee completeness, paths all around the world are explored, until they are more expensive than the top–k best results. Surprisingly, the fastest connection is via the quite distant airport Basel-Mulhouse (“MLH”) that features a direct connection to Rennes by the regional French “Airlinair”. The cheapest (and still reasonable fast) is by Ryanair via London-Stansted (“STN”) to the rather unknown Dinard (“DNR”, that has only connections to the UK). The straightforward connections via Hanover (“HAJ”) or Frankfurt (“FRA”) and Paris (“CDG”) are expensive and time-consuming. For this application scenario, breadth-first-search investigates more edges than A∗ would do, but for longer travels, first results are returned earlier (A∗ would tend to first combine short, cheap steps without really bridging distance). For changing to A∗ , the only change to the process definition is that instead of the call to cgdt :newVerticesBFS(gid) (Line 20), the accessor cgdt :nextVertexAStar(gid) has to be invoked.
211
212 id
timezone
¨ “Go” “HAJ” “FRA” “MLH” “CDG” “JER” ...
“+1” “+1” “+1” “+1” “+1” “0” ...
id
from
to
dept
arr
price
duration
e1 e2 e4 e5 e6 e7 e8 e12 e13 e15 e43 e74 e75 e97 e98 ...
¨ “Go” ¨ “Go” ¨ “Go” ¨ “Go” ¨ “Go” ¨ “Go” ¨ “Go” ¨ “Go” “HAJ” “HAJ” “FRA” “CDG” “CDG” “RNS” “RNS” ...
“HAJ” “FRA” “StM” “StM” “StM” “StM” “StM” “MLH” “CDG” “JER” “CDG” “StM” “RNS” “StM” “StM” ...
“7:50” “7:30” “6:55” “10:05” “13:00” “13:20” “5:30” “6:30” “10:00” “13:50” “11:50” “14:40” “16:40” “15:30” “19:30” ...
“8:50” “9:30” “17:30” “20:30” “23:55” “23:40” “15:40” “11:20” “11:20” “14:30” “13:00” “19:20” “17:50” “16:20” “20:20” ...
“49.00” “89.00” “225.00” “240.00” “248.00” “250.00” “280.00” “129.00” “229.00” “536.00” “239.00” “97.00” “241.00” “60.00” “60.00” ...
“01:00” “02:00” “10:35” “10:25” “10:55” “10:20” “10:10” “04:50” “01:20” “01:40” “01:10” “04:40” “01:10” “00:50” “00:50” ...
Table 9.1.: CGDT contents, vertex table (top) – edge table (bottom)
9.5. The Exploration Process (Revisited) The previous section illustrated the key points of the exploration of the search space based on the assumption that the Web data sources are complete with respect to available connections. Note that although these assumptions are not met exactly in real life, this is not a flaw in the approach itself but the analogy would be that the employee in the travel agency also needs to rely on the completeness of her information sources. Overall, the query workflow (and the graph configuration) that has actually been implemented and is used to gain insights into the applicability of the approach to a real-world situation differs in the following points from the ideal world scenario that was assumed in Section 9.43 : 3
The reason for this separation was motivated by the desire to show the strength of the approach itself without “external influences” that bias the exploration, as well as to show the applicability of the
212
Chapter 9. Application Scenario: Travel Planning
213
• error handling: when the German railways portal (http://www.bahn.de, view (1)) is queried with ambiguous starts or destinations, it shows an intermediate page with a list of possible choices. The list is sorted by some internal relevance criteria, i.e. it is not sorted alphabetically, and the DWQL engine uses the support for intermediate pages (cf. Section 6.3.2) to access the result page with the provided default selection. Obviously, this can lead to unexpected or erroneous results, which are handled by a heuristic (realized as an edge insertion condition in the graph) that takes into account the means of transport, the distance between source and destination and the expected travel time. If the heuristic detects that an edge is likely to be erroneous, it is rejected by the edge insertion condition, • missing values: some values are mandatory in the sense that they are used to define a derived property, or a condition. An example is the price of a connection, which is e.g. relevant for the top-k valuation function of the graph. The German railways portal (view (1)) only returns pricing information for trains that start and end in Germany, but nonetheless returns connection information for most European countries. In these cases, the according graph property price is initialized with a heuristic based on the distance (in km) and an average price of 0.25 e estimation, km • considered airports: as an optimization not only the top-10 closest airports to the start, but also only the top-10 airports closest to the destination are considered. Note that the “k” has been chosen by iterative refinement with different values. The idea is to avoid the blow up by airports with many outgoing edges, such as Frankfurt (“FRA”), which has world-wide connections to > 200 destinations. Another reason for this choice was to avoid “unnecessary” calls to the external service provider http://www.travenjoy.com (view (5)). This can be included straightforwardly in the process fragment in Section 9.4 by adding a call to the view (6) after Line 32 that computes the distances of all relevant airports to the destination and then applying the TopK operator on the result. Note that the existing workflow does not need to be modified, only additional “lines” are added to the workflow. • intensional edges: the discussion in Section 9.4 expected every airport to have an associated train station. While this is true for some airports, e.g. “FRA”, many airports are outside of cities and can only be reached by shuttle buses or taxis, e.g. Basel-Mulhouse (“MLH”). To cover these cases, for each such airport the closest train station is determined via the AirDB (cf. Example 33 on page 110) and an intensional edge is added that represents a taxi from the train station to the airport or vice versa. Finally, as already mentioned in the last section, an intensional edge is also added, if the so far reached destination is closer than 100km to the final destination. The price for taxis is determined similar to missing prices for train connections (but with a more expensive 1.0 e estimation). km solution to real-world situations.
213
214
9.6. Experimental Evaluation System Configuration The MARS system itself is highly decoupled and all language processors can run on different machines, if necessary. For the conducted experiments, a configuration with three machines was chosen, where all MARS-specific services run on the same computer (calvin). Figure 9.2 further illustrates the flow of communication between machines. They are configured as follows4 : 1. calvin: the main part runs on this PC with a Pentium Dual Core E2180 processor (2,0 GHz) having 6GB main memory, and a 250GB hard disk drive, 2. gemma: the Deep Web navigation component of the DWQL engine runs on a separate machine (cf. Section 6.3.4). The main reason is that the browser needs to be running for the navigation to work and the management of several tabs for parallel navigation requires a considerable amount of CPU time. It is a PC with a Pentium Core 2 Duo E6400 processor (2,13 GHz) with 3GB main memory, and a 500GB hard drive, 3. bali: the Oracle DB stores the graph instance data and runs on a Dell Power Edge 1850 server with two XEON processors (3,2 GHz) with 4GB main memory, and a 73GB hard disk drive. Figure 9.2 System configuration (overview) bali Oracle DB calvin RelCCS (MARS) CGDT engine DWQL engine (master) WSQL engine gemma DWQL engine (DWN)
Evaluation Results The results discussed in the following are not meant to be understood as benchmark results in the classical sense (i.e. how good does the implementation perform in different settings, how does it scale with respect to different input sizes, etc.). Instead they are used to gain insights into the critical aspects during the evaluation, and are intended as a proof of concept of the applicability of the approach. The overall runtime (averaged over three runs – cf. Table 9.3) is roughly 14 12 h, i.e. if the query workflow is 4
The hard drive sizes are only mentioned for reasons of completeness; the computers are connected via a Gigabit ethernet connection.
214
Chapter 9. Application Scenario: Travel Planning
215
started in the evening, the results will be available the next morning. Compared to the alternative, which involves manually collecting and assembling the data, this is a great improvement. Table 9.4 shows that view (1) takes on average ∼ 20 seconds to return results (including navigation and Web data extraction time). In total, ∼ 70 train connections are requested, where in contrast ∼ 575 flight connections are tried (cf. Table 9.5). The views (2 – 4) are implemented as SPARQL views and consequently their invocations are summed up. With this respect, the evaluation time is a measure of the speed of the underlying SPARQL engine (plus additional overhead for marshalling and unmarshalling inputs and results as MARS variable bindings), which explains the response time of ∼ 0.5 seconds per query. Where the execution time of view (1) is a good indicator for the evaluation of a Deep Web view, the times reported for view (5) (∼ 4 seconds) can be seen as a baseline for evaluating Web Service views5 . Taken together, the DWQL and WSQL engine results form the “socket” execution time of ∼ 2 hours, that can only be marginally improved upon, because the external Web data sources (view (1) and (5)) cannot be influenced. This entails that realtime answers (i.e. in the range of ∼ 30 seconds) for query workflows that rely on Web data sources cannot be guaranteed, if the exploration is to be done online. Table 9.6 depicts the salient key numbers for the CGDT implementation. The most interesting insights are that ∼ 145 places are encountered during the exploration (= #(vertices)) and that ∼ 3100 possible connections (= #(ext. edges) + #(int. edges)) between these places are investigated, resulting in ∼ 32000 paths (= #(paths)) that “survive” the subsumption criteria and satisfy the path extension condition. For even processing a subset of these possibilities would take a travel agent employee at least a week and would require a profound knowledge of the available means of transport. Another lesson that can be learnt from Table 9.6 is that the major time of the evaluation (> 60%) is spent on updating the paths, which is mainly due to the costly subsumption check that is computed after a batch of edges is inserted. Note that the overall time spent on managing the CGDT instance is not the sum of all total times in Table 9.6 (e.g. the time for adding new vertices is contained in the sum of adding internal and external edges), but the closest approximation is the sum of the total time spent on adding edges and updating paths (i.e. > 70% of the overall execution time are due to graph manipulations).
A Note on Top-k and Path Subsumption At first sight it is surprising to see that the number of encountered vertices, edges, and paths seems to differ among the three runs more than expected, if the only reason would be the changing availability of flights (cf. Table 9.6). The main reason for the differing number of vertices, and edges is that the process is running asynchronously, i.e. several branches are running in different stages of the 5
There are no results for view (6) – the spheric distance computation – since it was realized natively.
215
216 recursion. Consequently, each branch adds new edges to the graph in parallel, which in turn results in newly available vertices. If now, another branch enters the begin of the recursion and queries for newly available vertices (Line 20 in Section 9.4), the set of returned vertices contains vertices that have become newly reachable by this branch and by all other branches, as well. Note that this is highly dependent on the response time of the Web data sources, and the execution speed of the different branches in general. This non-determinism results in different sets of tuples that are considered for the top-10 nearest airports to the destination (cf. Section 9.5) in each run, which is evaluated locally to this set of tuples, and thus the number of vertices (e.g. new airports) that are considered during different runs differs and so does the number of totally investigated vertices and the number of resulting edges accordingly. The abovementioned observations regarding the number of vertices and edges also directly influence the path subsumption computation. Because different numbers of edges are added in each run by each branch, the path subsumption operates on different sets of path candidates each time, which in turn yields different paths that are subsumed, i.e. a different number of paths is considered in each run. Note that the final results are only dependent on the availability of flights and are almost identical for each run including the ranking of the found solutions. Beyond the Numbers In Section 9.1 the claim was made that the query workflow discussed in this chapter will “outperform” a travel agency in terms of diligence and preciseness, and find the best possible solutions under the (closed world) assumption that the selected Web data sources provide complete coverage of the available connections. This claim has been substantiated in Section 9.4 with a walkthrough of the travel planning workflow. Table 9.7 shows an excerpt of two of the top-35 results automatically obtained by running the described query workflow. The best found solution is indeed one of those, the travel agency was not aware of at all. Also, the intuitive solution via the intermediate airport “CDG” (choice (2) in Section 9.1) is not among the top-10 options. Clearly, the one-time experience is not representative of all available travel agencies, and another agency might have been able to find similar acceptable solutions. Yet, the application scenario illustrates that query workflows offer a powerful and expressive formalism with a concise declarative semantics that are well-suited for solving challenging real-world problems, such as travel planning. In cases, where the search space is graph-structured and some kind of exploration strategy is required, the CGDT abstraction additionally helps to define the information need in a natural way by describing the properties of the result via CGDT concepts, such as insertion conditions and inductive property definitions.
216
Chapter 9. Application Scenario: Travel Planning
217
id
front
last
(comment)
...
dept
arr
price
#
p1 p3 p5 p7 p10 p11 p12 p26 p28 p33 p48 p53 p63 p84
nil nil nil nil nil nil nil p1 p1 p1 p3 p3 p12 p26
e1 e2 e4 e5 e7 e8 e12 e13 e15 e20 e43 e48 e61 e74
... ... ... ... ... ... ... ... ... ... ... ... ... ...
“7:50” “7:30” “6:55” “10:05” “13:20” “5:30” “6:30” “7:50” “7:50” “7:50” “7:30” “7:30” “7:30” “7:50”
“8:50” “9:30” “17:30” “20:30” “23:40” “15:40” “11:20” “13:30” “14:30” “11:50” “13:00” “17:00” “14:10” “19:20”
“49.00” “89.00” “225.00” “240.00” “250.00” “280.00” “129.00” “278.00” “585.00” “129.00” “328.00” “628.00” “568.00” “375.00”
– – – – – – – – – – – – – (i)
p85
p48
e74
...
“7:30”
“19:20”
“425.00”
(i)
p86
p26
e75
...
“7:50”
“17:50”
“519.00”
(i)
p87
p48
e75
...
“7:30”
“17:50”
“569.00”
(i)
p93
p28
e84
...
“7:50”
“17:30”
“628.00”
–
p98
p33
e91
...
“7:50”
“15:30”
“208.00”
–
p100
p53
e48
...
“7:30”
“19:30”
“817.00”
–
p101
p63
e97
...
“7:30”
“16:20”
“628.00”
(ii)
p103
p86
e98
...
“7:50”
“20:20”
“579.00”
(iii)
p104
p87
e98
...
“7:30”
“20:20”
“629.00”
(iii)
p116
p98
e104
...
“7:50”
“17:00”
“218.00”
–
...
...
...
¨ → “HAJ” “Go” ¨ → “FRA” “Go” ¨ → “StM” “Go” ¨ → “StM” “Go” ¨ → “StM” “Go” ¨ → “StM” “Go” ¨ → “MLH” “Go” ¨ → “HAJ” → “CDG” “Go” ¨ → “HAJ” → “JER” “Go” ¨ → “HAJ” → “STN” “Go” ¨ → “FRA” → “CDG” “Go” ¨ → “FRA” → “JFK” “Go” ¨ → “MLH” → “RNS” “Go” ¨ → “HAJ” → “CDG” “Go” → “StM” ¨ → “FRA” → “CDG” “Go” → “StM” ¨ → “HAJ” → “CDG” “Go” → “RNS” ¨ → “FRA” → “CDG” “Go” → “RNS” ¨ → “HAJ” → “JER” “Go” → “StM” ¨ → “HAJ” → “STN” “Go” → “DNR” ¨ → “FRA” → “JFK” “Go” → “ALB” ¨ → “MLH” → “RNS” “Go” → “StM” ¨ → “HAJ” → “CDG” “Go” → “RNS” → “StM” ¨ → “FRA” → “CDG” “Go” → “RNS” → “StM” ¨ → “HAJ” → “STN” “Go” → “DNR” → “StM” ...
...
...
...
...
...
Table 9.2.: CGDT contents, path table
217
218
Run 1 Run 2 Run 3 Total time (walltime) 14h 6min 6.566sec 11h 52min 11.35sec 17h 8min 16.59sec Table 9.3.: Evaluation results (overview)
Run 1 Run 2 Run 3 #(view (1)) 75 68 79 Time (total) 31min 46.483sec 21min 28.246sec 25min 37.773sec 25419.77ms 18944.79ms 19465.48ms Time (avg.) Table 9.4.: Evaluation results (DWQL engine)
Run 1 Run 2 Run 3 Time total 1h 5min 4.265sec 1h 6min 16.313sec 1h 3min 32.12sec #(views (2 – 4)) 449 474 474 Time avg. (views (2 – 4)) 589.15ms 595.5ms 586.34ms 569 560 616 #(view (5)) Time avg. (view (5)) 4353.91ms 3998.76ms 4397.61ms Table 9.5.: Evaluation results (WSQL engine)
218
Chapter 9. Application Scenario: Travel Planning
219
Run 1 Run 2 Run 3 #(vertices) 142 153 153 Time (total) 1h 6min 6.391sec 44min 30.588sec 40min 14.484sec 27932.33ms 17454.82ms 15780.94ms Time (avg.) #(bfs) 5 5 7 Time (total) 117ms 94ms 103ms 23.4ms 18.8ms 14.7ms Time (avg.) Run 1 Run 2 Run 3 #(ext. edges) 3257 2966 3323 Time (total) 1h 7min 44.437sec 45min 59.609sec 41min 40.373ms 1247.91ms 930.41ms 752.44ms Time (avg.) #(int. edges) 15 15 15 Time (total) 2h 10min 4.851sec 1h 10min 1.61sec 1h 28min 32.204ms 520323.4ms 280107.33ms 354146.93ms Time (avg.) Run 1 Run 2 Run 3 #(paths) 32492 29297 34032 Time (total) 9h 13min 21.168sec 7h 25min 6.423sec 9h 37min 59.117sec 1021.83ms 911.58ms 1019.01ms Time (avg.) Table 9.6.: Evaluation results (CGDT engine) – vertex (top), edge (middle), path (bottom)
219
#
Result Path from
to
dept
arr
price
duration
¨ “Gottingen”
“St. Malo”
“2009-10-27T09:56:00”
“2009-10-27T19:25:00”
“289.90”
“PT9H29M”
Path consists of the following edges: #
from
to
dept
arr
price
...
1 2 3 4
¨ “Gottingen” “Hannover (Hbf)” “HAJ” “JER”
“Hannover (Hbf)” “HAJ” “JER” “St. Malo”
“2009-10-27T09:56:00” “2009-10-27T10:47:00” “2009-10-27T13:55:00” “2009-10-27T18:15:00”
“2009-10-27T10:32:00” “2009-10-27T10:57:00” “2009-10-27T18:00:00” “2009-10-27T19:25:00”
“32.00” “10.00” “177.90” “70.00”
... ... ... ...
...
... from
to
dept
arr
price
duration
¨ “Gottingen”
“St. Malo”
“2009-10-27T10:55:00”
“2009-10-27T22:32:00”
“356.02”
“PT11H37M”
Path consists of the following edges: 12
...
#
from
to
dept
arr
price
...
1 2 3 4
¨ “Gottingen” “FRA” “CDG” “RNS”
“FRA” “CDG” “RNS” “St. Malo”
“2009-10-27T10:55:00” “2009-10-27T14:25:00” “2009-10-27T20:00:00” “2009-10-27T21:30:00”
“2009-10-27T12:44:00” “2009-10-27T18:00:00” “2009-10-27T21:15:00” “2009-10-27T22:32:00”
“58.00” “86.69” “148.88” “62.45”
... ... ... ...
...
220
Table 9.7.: Top-35 connections from G¨ottingen to St. Malo (excerpt)
220
1
Part V. Discussion
221
Chapter 10. Conclusion Imagine a world in which every single person on the planet is given free access to the sum of all human knowledge. – Jimmy “Jimbo” Wales (born August 8, 1966)
Contents 10.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 10.2. Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.1. Summary With the constantly evolving nature of the World Wide Web, new solutions that build on top of this infrastructure need to be agile to meet the constantly changing requirements of this technological environment. The strategy chosen in this thesis to address this concern is to build upon the MARS framework that by itself is higly agile in the sense that it embraces heterogeneous languages as a core aspect of its architecture. Another aspect is the highly decoupled nature of the proposed approach. The non-uniform shape of the Web wrt. to how the available data is represented and the varying degrees to which it is already machine-accessible motivated the generic notion of a Web data source that hides the lower level technical details of the underlying source. This abstraction has been specified without a binding to a concrete implementation or style of programming but in terms of the Web Data Source Description Language (WDSDL) ontology. This way, Deep Web sources, Web Services, but also SPARQL endpoints, among others are subsumed by this notion, concealing the “bumps” in the shape of the (data) Web from the external caller. If the past is an indicator of the future, the rapid development and adoption of new paradigms will continue on the Web and new ways of accessing and publishing data will be adopted in the near future. By being agile, these new kinds of data sources can be easily integrated on the conceptual level as “yet another Web data source”. The next layer operating on top of Web data sources is concerned with the combination of information from different, heterogeneous Web data sources. Here, one key aspect was
223
224 the extension of the process algebra CCS [Mil83] with relational dataflow resulting in the versatile RelCCS language. RelCCS not only has a well-founded data model but also introduces further operators from the relational realm such as grouping, the possibility to express negation, as well as the top-k operator that facilitates declarative data-dependent optimizations at runtime. In terms of expressive power, RelCCS is a Turing-complete language by itself (i.e. without the need for embedding external Turing-complete languages via MARS mechanisms), and it supports the definition of recursive processes, where a part of the variables is defined as local to the process fragment and may be assigned new values in each invocation. The configurable graph datatype (CGDT) was designed for the modeling of graphstructured search spaces, where a typical application is the (partial) exploration of the transitive closure. Equipped with the possibility to assign properties to vertices, edges, and paths, it has featured prominently in the application scenario in Chapter 9. Another salient aspect of the CGDT is the possibility to define derived properties in an inductive manner and to guide the instantiation of new elements in the graph by providing the appropriate insertion and extension conditions. To conclude, (for most people) the major hurdle to a meaningful access to data on the Web is not essentially that it is free (while this is admittedly a necessary prerequisite). There already is a significant amount of data that is freely available on the Web and which theoretically is at the fingertips of the user but often the combination of even a small amount of the data is beyond a human being. This thesis presented an approach that enables users to make better use of this already available data.
10.2. Perspectives There are several directions in which this thesis can be extended and that are currently investigated: • SPARQL over Web Data Sources: in Chapter 4, Web data sources have been described with respect to the WDSDL ontology. These semantic annotations cover the technical handling of (wrapped) sources, the description of their input and output characteristics in terms of tags. In [HM09c], an extended version of the WDSDL ontology that also supports the connection of these tags to concepts of (RDF(S)/OWL) domain ontologies, and initial ideas for evaluating SPARQL [PS08] queries on top of Web data sources have been presented. The pursued approach is to build on the large body of work on query answering using views [Ull97, Hal01, Len02] and on querying data under access limitations [LC01, Li03, NL04b, YKC06, CM08a, CM10] and extend the therein presented approaches with the notion of a strategy to facilitate the automatic generation of elaborate query workflows, such as e.g. required for application areas like travel planning, for answering SPARQL queries.
224
Chapter 10. Conclusion
225
• Process Modeling Support: While the modeling of query workflows currently has to be done by a skilled process designer in cooperation with domain experts, a visual representation of the language is envisioned with an accompanying development environment. In line with this effort, it is currently evaluated if earlier work on providing process designers with recommendations during modeling reported in [HKL08] can be transferred to RelCCS. Lastly, as another important aspect of any framework, a comprehensive monitoring and debugging environment for RelCCS processes is planned as future work. • CGDT: As has been noted in the area of graph transformations [BH02], graphs are an extremely expressive data model and can be applied to a whole range of application areas. Consequently, one future direction is to explore the applicability of the CGDT in different settings – an optimized implementation of the CGDT is currently pursued, since the experimental results of the travel plannning scenario revealed graph-related operations as the major bottleneck. • Exploitation: During the course of this thesis, the travel planning application scenario has proven to be an inspiration for improvements of several different aspects of the proposed approach and still continues do so. Another result of the experimental evaluation of the application scenario was that the lower bound for the execution of query workflows in an online manner is still in the range of approximately two hours because a significant number of calls to external service providers is required to gather the necessary data. One way to speed up processing would be to “soften” the online criteria by building up a long-term cache of available flights and their prices given the assumption that the relationship between prices stays constant, i.e. if at a certain date it was cheaper to fly from “Frankfurt Airport” (“FRA”) to “London Heathrow” (“LHR”) than e.g. to “Paris Charles de Gaulle” (“CGD”) it most likely is still cheaper at present. This way, a classical shortest path algorithm could be used to prune the search space to the top-k most likely connections and then to query only those connections for new pricing information and also availability. The ultimate goal of this line of research could be the creation of a commercial travel planning system that could either be used inside of travel agencies or that could be offered as an online service.
225
226
226
Appendix A. Configurable Graph Data Type A.1. CGDT RDFS Ontology 1 2 3 4 5
@prefix @prefix @prefix @prefix @prefix
cgdt: mars: rdf: rdfs: xs:
. . . . .
6 7 8 9 10 11
# Classes cgdt:Graph a rdfs:Class. cgdt:Schema a rdfs:Class. cgdt:Expression a rdfs:Class. cgdt:ResultSpecification a rdfs:Class.
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
cgdt:CycleFreeGraph rdfs:subClassOf cgdt:Graph. cgdt:Condition rdfs:subClassOf cgdt:Expression. cgdt:VertexInsertionCondition rdfs:subClassOf cgdt:Condition. cgdt:VertexPropertyConditin rdfs:subClassOf cgdt:VertexInsertionCondition. cgdt:EdgeInsertionCondition rdfs:subClassOf cgdt:Condition. cgdt:EdgePropertyCondition rdfs:subClassOf cgdt:EdgeInsertionCondition. cgdt:PathExtensionCondition rdfs:subClassOf cgdt:Condition. cgdt:PathInductiveExtensionCondition rdfs:subClassOf cgdt:PathExtensionCondition. cgdt:Property rdfs:subClassOf rdf:Property. cgdt:AggregateProperty rdfs:subClassOf cgdt:Property. cgdt:SumProperty rdfs:subClassOf cgdt:AggregateProperty. cgdt:CountProperty rdfs:subClassOf cgdt:AggregateProperty. cgdt:MinProperty rdfs:subClassOf cgdt:AggregateProperty. cgdt:MaxProperty rdfs:subClassOf cgdt:AggregateProperty. cgdt:InductiveProperty rdfs:subClassOf cgdt:Property. cgdt:VertexSchema rdfs:subClassOf cgdt:Schema.
227
228
30 31
cgdt:EdgeSchema rdfs:subClassOf cgdt:Schema. cgdt:PathSchema rdfs:subClassOf cgdt:Schema.
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
# Properties cgdt:hasProperty rdfs:domain cgdt:Schema; rdfs:range cgdt:Property. cgdt:hasSchema rdfs:domain cgdt:Graph; rdfs:range cgdt:Schema. cgdt:range rdfs:domain cgdt:Property; rdfs:range xs:anyType. cgdt:label rdfs:domain cgdt:Property; rdfs:range xs:string. cgdt:definition rdfs:domain cgdt:Property; rdfs:range cgdt:Expression. cgdt:language rdfs:domain cgdt:Expression; rdfs:range mars:Language. cgdt:language rdfs:domain cgdt:Expression; rdfs:range xs:string. cgdt:baseProperty rdfs:domain cgdt:AggregateProperty; rdfs:range xs:string. cgdt:baseCase rdfs:domain cgdt:InductiveProperty, cgdt:PathInductiveExtensionCondition; rdfs:range cgdt:Expression. cgdt:inductiveCase rdfs:domain cgdt:InductiveProperty, cgdt:PathInductiveExtensionCondition; rdfs:range cgdt:Expression. cgdt:hasVertexSchema rdfs:domain cgdt:Graph; rdfs:range cgdt:VertexSchema; rdfs:subPropertyOf cgdt:hasSchema. cgdt:hasVertexInsertionCondition rdfs:domain cgdt:VertexSchema; rdfs:range cgdt:VertexInsertionCondition. cgdt:hasEdgeInsertionCondition rdfs:domain cgdt:EdgeSchema; rdfs:range cgdt:EdgeInsertionCondition. cgdt:hasPathExtensionCondition rdfs:domain cgdt:PathSchema; rdfs:range cgdt:PathExtensionCondition. cgdt:hasEdgeSchema rdfs:domain cgdt:Graph; rdfs:range cgdt:EdgeSchema; rdfs:subPropertyOf cgdt:hasSchema. cgdt:hasPathSchema rdfs:domain cgdt:Graph; rdfs:range cgdt:PathSchema; rdfs:subPropertyOf cgdt:hasSchema. cgdt:filterCondition rdfs:domain cgdt:ResultSpecification; rdfs:range cgdt:Expression. cgdt:numberOfResults rdfs:domain cgdt:ResultSpecification; rdfs:range xs:int. cgdt:hasValuationFunction rdfs:domain cgdt:ResultSpecification; rdfs:range cgdt:Expression. cgdt:hasSubsumptionCriterion rdfs:domain cgdt:ResultSpecification; rdfs:range xs:string.
228
Bibliography [AAA+ 07]
Alexandre Alves, Assaf Arkin, Sid Askary, Charlton Barreto, Ben Bloch, Francisco Curbera, Mark Ford, Yaron Goland, Alejandro Gu´ızar Neelakantan Kartha, Canyang Kevin Liu, Rania Khalaf, Dieter K¨onig, Mike Marin, Vinkesh Mehta, Satish Thatte, Danny van der Rijn, Prasad Yendluri, and Alex Yiu. Web Services Business Process Execution Language, Version 2.0 (wS-BPEL). http://docs.oasis-open.org/wsbpel/2.0/OS/ wsbpel-v2.0-OS.html, 2007.
[AAB+ 08]
Jos´e J´ ulio Alferes, Ricardo Amador, Erik Behrends, Oliver Fritzen, Wolfgang May, and Franz Schenk. Pre-Standardization of the Language. Technical Report I5-D10, REWERSE EU FP6 NoE, 2008. Available at http://www.rewerse.net.
[AAC+ 01]
Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass, editors. VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy. Morgan Kaufmann, 2001.
[ABJ+ 04]
Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Lud¨ascher, and Steve Mock. Kepler: An Extensible System for Design and Execution of Scientific Workflows. In SSDBM 2004, pages 423–424, 2004.
[ADG+ 09]
Jos´e Luis Ambite, Sirish Darbha, Aman Goel, Craig A. Knoblock, Kristina Lerman, Rahul Parundekar, and Thomas A. Russ. Automatically Constructing Semantic Web Services from Online Sources. In Bernstein et al. [BKH+ 09], pages 17–32.
[AEM09]
Jos´e J´ ulio Alferes, Michael Eckert, and Wolfgang May. Evolution and Reactivity in the Semantic Web. In Fran¸cois Bry and Jan Maluszynski, editors, REWERSE, volume 5500 of Lecture Notes in Computer Science, pages 161–200. Springer, 2009.
[AFKL00]
Vinod Anupam, Juliana Freire, Bharat Kumar, and Daniel F. Lieuwen. Automating Web Navigation with the WebVCR. Computer Networks, 33(1-6):503–517, 2000.
[Aga04]
Sudhir Agarwal. Specification of Invocable Semantic Web Resources. In ICWS [DBL04], pages 124–131.
[AGM08]
Albert Atserias, Martin Grohe, and D´aniel Marx. Size Bounds and Query
229
Plans for Relational Joins. In FOCS, pages 739–748. IEEE Computer Society, 2008. [AHV95]
Serge Abiteboul, Richard Hull, and Victor Vianu. Databases. Addison-Wesley, 1995.
[AL07]
S¨oren Auer and Jens Lehmann. What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content. In Enrico Franconi, Michael Kifer, and Wolfgang May, editors, ESWC, volume 4519 of Lecture Notes in Computer Science, pages 503–517. Springer, 2007.
[Alb03]
Susanne Albers. Online Algorithms: A Survey. Mathematical Programming, 97(1-2):3–26, 2003. ¨ ¨ Bernd Amann. XSL/XSLT. In Liu and Ozsu [LO09], pages 3676–3681.
[Ama09]
Foundations of
[Ame97]
American National Standards Institute. ANS X3.4-1986 (R1997): Information Systems: Coded Character Sets – 7-Bit American National Standard Code for Information Interchange, 1997.
[ANC96]
Corporate ACT-NET Consortium. ACT-NET - The Active Database Management System Manifesto: A Rulebase of ADBMS Features. SIGMOD Record, 25(3):40–49, 1996.
[AS06]
Sudhir Agarwal and Rudi Studer. Automatic Matchmaking of Web Services. In ICWS, pages 45–54. IEEE Computer Society, 2006.
[AWS06]
Richard Atterer, Monika Wnuk, and Albrecht Schmidt. Knowing the User’s Every Move: User Activity Tracking for Website Usability Evaluation and Implicit Interaction. In Les Carr, David De Roure, Arun Iyengar, Carole A. Goble, and Michael Dahlin, editors, WWW, pages 203–212. ACM, 2006.
[BBC+ 07]
Anders Berglund, Scott Boag, Don Chamberlin, Mary F. Fern´andez, Michael Kay, Jonathan Robie, and J´erˆome Sim´eon. XML Path Language (XPath) 2.0. http://www.w3.org/TR/xpath20, 2007.
[BBL08]
David Beckett and Tim Berners-Lee. Turtle – Terse RDF Triple Language. http://www.w3.org/TeamSubmission/turtle/, 2008.
[BCDM08]
Daniele Braga, Stefano Ceri, Florian Daniel, and Davide Martinenghi. Optimization of Multi-domain Queries on the Web. PVLDB, 1(1):562– 573, 2008.
[BCF+ 01]
Scott Boag, Don Chamberlin, Mary F. Fern´andez, Daniela Florescu, Jonathan Robie, and J´erˆome Sim´eon. XQuery 1.0: A Query Language for XML. http://www.w3.org/TR/xquery, 2001.
[BCG+ 07]
Boualem Benatallah, Fabio Casati, Dimitrios Georgakopoulos, Claudio Bartolini, Wasim Sadiq, and Claude Godart, editors. Web Information Systems Engineering - WISE 2007, 8th International Conference on Web Information Systems Engineering, Nancy, France, December 3-7, 2007,
230
Proceedings, volume 4831 of Lecture Notes in Computer Science. Springer, 2007. [BCH05]
Dirk Beyer, Arindam Chakrabarti, and Thomas A. Henzinger. Web Service Interfaces. In Allan Ellis and Tatsuya Hagino, editors, WWW, pages 148–159. ACM, 2005.
[BCL05]
Robert Baumgartner, Michal Ceresna, and Gerald Ledermuller. Deep Web Navigation in Web Data Extraction. In CIMCA/IAWTIC, pages 698–703. IEEE Computer Society, 2005.
[BCM+ 03]
Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003.
[Bec04]
Dave Beckett. RDF/XML Syntax Specification (Revised). http://www. w3.org/TR/REC-rdf-syntax/, 2004.
[BF07]
Luciano Barbosa and Juliana Freire. An Adaptive Crawler for Locating Hidden Web Entry Points. In Carey L. Williamson, Mary Ellen Zurko, Peter F. Patel-Schneider, and Prashant J. Shenoy, editors, WWW, pages 441–450. ACM, 2007.
[BFG01]
Robert Baumgartner, Sergio Flesca, and Georg Gottlob. Visual Web Information Extraction with Lixto. In Apers et al. [AAC+ 01], pages 119–128.
[BFMS06]
Erik Behrends, Oliver Fritzen, Wolfgang May, and Franz Schenk. Combining ECA Rules with Process Algebras for the Semantic Web. In Eiter et al. [EFHS06], pages 29–38.
[BFMS08]
Erik Behrends, Oliver Fritzen, Wolfgang May, and Franz Schenk. Embedding Event Algebras and Process Algebras in a Framework for ECA Rules for the Semantic Web. Fundamenta Informaticae, 82:237–263, 2008.
[BG04]
Dan Brickley and R.V. Guha. RDF Vocabulary Description Language 1.0: RDF Schema. http://www.w3.org/TR/rdf-schema/, 2004.
[BGH09]
Robert Baumgartner, Georg Gottlob, and Marcus Herzog. Scalable Web Data Extraction for Online Market Intelligence. PVLDB, 2(2):1512–1523, 2009.
[BH02]
Luciano Baresi and Reiko Heckel. Tutorial Introduction to Graph Transformation: A Software Engineering Perspective. In Andrea Corradini, Hartmut Ehrig, Hans-J¨org Kreowski, and Grzegorz Rozenberg, editors, ICGT, volume 2505 of Lecture Notes in Computer Science, pages 402– 429. Springer, 2002.
[BHL+ 02]
Mark H. Burstein, Jerry R. Hobbs, Ora Lassila, David L. Martin, Drew V. McDermott, Sheila A. McIlraith, Srini Narayanan, Massimo Paolucci, Terry R. Payne, and Katia P. Sycara. DAML-S: Web Service Description for the Semantic Web. In Horrocks and Hendler [HH02], pages 348–363.
231
[BHL+ 09]
Tim Bray, Dave Hollander, Andrew Layman, Richard Tobin, and Henry S. Thompson. Namespaces in XML 1.0 (Third Edition). http://www.w3. org/TR/xml-names, 2009.
[BHLM02]
Joseph Buck, Soonhoi Ha, Edward A. Lee, and David G. Messerschmitt. Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems. pages 527–543, 2002.
[BHRT03]
Boualem Benatallah, Mohand-Said Hacid, Christophe Rey, and Farouk Toumani. Request Rewriting-based Web Service Discovery. In Dieter Fensel, Katia P. Sycara, and John Mylopoulos, editors, International Semantic Web Conference, volume 2870 of Lecture Notes in Computer Science, pages 242–257. Springer, 2003.
[Biz03]
Christian Bizer. D2R Map – A Database to RDF Mapping Language. In WWW (Posters), 2003.
[BK94]
A. J. Bonner and M. Kifer. An Overview of Transaction Logic. Theoretical Computer Science, 133(2):205–265, 1994.
[BK08]
Michael Benedikt and Christoph Koch. XPath Leashed. ACM Comput. Surv., 41(1), 2008.
[BKH+ 09]
Abraham Bernstein, David R. Karger, Tom Heath, Lee Feigenbaum, Diana Maynard, Enrico Motta, and Krishnaprasad Thirunarayan, editors. The Semantic Web - ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings, volume 5823 of Lecture Notes in Computer Science. Springer, 2009.
[BL07]
Tim Berners-Lee. Semantic Web Layer http://www.w3.org/2007/03/layerCake.png, MAR 2007.
[BLFM05]
Tim Berners-Lee, Roy Thomas Fielding, and Larry Masinter. RFC 3986, Uniform Resource Identifier (URI): Generic Syntax, 2005.
[BLHL01]
Tim Berners-Lee, James Hendler, and Ora Lassila. The Semantic Web. Scientific American, 2001.
[BLMM94]
Tim Berners-Lee, Larry Masinter, and Mark McCahill. RFC 1738, Uniform Resource Locators (URL), 1994.
[BNZ06]
Mario Bravetti, Manuel N´ un ˜ez, and Gianluigi Zavattaro, editors. Web Services and Formal Methods, Third International Workshop, WS-FM 2006 Vienna, Austria, September 8-9, 2006, Proceedings, volume 4184 of Lecture Notes in Computer Science. Springer, 2006.
[BP06]
Antonio Brogi and Razvan Popescu. From BPEL Processes to YAWL Workflows. In Bravetti et al. [BNZ06], pages 107–122.
[BPSM+ 08]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and Fran¸cois Yergeau. Extensible Markup Language (XML) 1.0 (Fifth Edition). http: //www.w3.org/TR/xml/, 2008.
232
Cake.
[BvH07]
Adam Barker and Jano I. van Hemert. Scientific Workflow: A Survey and Research Directions. In Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, and Jerzy Wasniewski, editors, PPAM, volume 4967 of Lecture Notes in Computer Science, pages 746–753. Springer, 2007.
[BWR+ 05]
Michael Bolin, Matthew Webber, Philip Rha, Tom Wilson, and Robert C. Miller. Automation and Customization of Rendered Web Pages. In Patrick Baudisch, Mary Czerwinski, and Dan R. Olsen, editors, UIST, pages 163–172. ACM, 2005.
[CB74]
Donald D. Chamberlin and Raymond F. Boyce. SEQUEL: A Structured English Query Language. In Randall Rustin, editor, SIGMOD Workshop, Vol. 1, pages 249–264. ACM, 1974.
[CCCR+ 90]
Filippo Cacace, Stefano Ceri, Stefano Crespi-Reghizzi, Letizia Tanca, and Roberto Zicari. Integrating Object-Oriented Data Modeling with a RuleBased Programming Paradigm. In Hector Garcia-Molina and H. V. Jagadish, editors, SIGMOD Conference, pages 225–236. ACM Press, 1990.
[CCMW01]
Erik Christensen, Francisco Curbera, Greg Meredith, and Sanjiva Weerawarana. Web Services Description Language (WSDL) 1.1. http://www.w3.org/TR/wsdl, 2001.
[CCPS00]
John Cardiff, Tiziana Catarci, M. Passeri, and Giuseppe Santucci. Querying Multiple Databases Dynamically on the World Wide Web. In Qing Li, ¨ Z. Meral Ozsoyoglu, Roland Wagner, Yahiko Kambayashi, and Yanchun Zhang, editors, WISE, pages 238–245. IEEE Computer Society, 2000.
[CD99]
James Clark and Steve DeRose. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, 1999.
[CDA+ 06]
Isabel F. Cruz, Stefan Decker, Dean Allemang, Chris Preist, Daniel Schwabe, Peter Mika, Michael Uschold, and Lora Aroyo, editors. The Semantic Web - ISWC 2006, 5th International Semantic Web Conference, ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings, volume 4273 of Lecture Notes in Computer Science. Springer, 2006.
[Cer02]
Ethan Cerami. Web Services Essentials. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 2002.
[CFT08]
Kendall Grant Clark, Lee Feigenbaum, and Elias Torres. SPARQL Protocol for RDF. http://www.w3.org/TR/rdf-sparql-protocol/, 2008.
[CGH+ 06]
David Churches, Gabor Gomb´as, Andrew Harrison, Jason Maassen, Craig Robinson, Matthew S. Shields, Ian J. Taylor, and Ian Wang. Programming Scientific and Distributed Workflow with Triana Services. Concurrency and Computation: Practice and Experience, 18(10):1021–1037, 2006.
[CGM02]
Junghoo Cho and Hector Garcia-Molina. Parallel Crawlers. In WWW, pages 124–135, 2002.
[CHL+ 07]
Roberto Chinnici, Hugo Haas, Amelia A. Lewis, Jean-Jacques
233
Moreau, David Orchard, and Sanjiva Weerawarana. Web Services Description Language (WSDL) Version 2.0 Part 2: Adjuncts. http://www.w3.org/TR/wsdl20-adjuncts/, 2007. [CHvRR04]
Luc Clement, Andrew Hately, Claus von Riegen, and Tony Rogers. Universal Description Discovery & Integration (UDDI), Version 3.0.2. http: //uddi.org/pubs/uddi_v3.htm, 2004.
[CHZ05]
Kevin Chen-Chuan Chang, Bin He, and Zhen Zhang. Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web. In CIDR, pages 44–55, 2005.
[CKAK94]
Sharma Chakravarthy, V. Krishnaprasad, Eman Anwar, and S.-K. Kim. Composite Events for Active Databases: Semantics, Contexts and Detection. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, VLDB, pages 606–617. Morgan Kaufmann, 1994.
[CKGS06]
Chia-Hui Chang, Mohammed Kayed, Moheb R. Girgis, and Khaled F. Shaalan. A Survey of Web Information Extraction Systems. IEEE Trans. Knowl. Data Eng., 18(10):1411–1428, 2006.
[CLR89]
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. MIT Press, 1989.
[CM94]
Sharma Chakravarthy and D. Mishra. Snoop: An Expressive Event Specification Language for Active Databases. Data Knowl. Eng., 14(1):1–26, 1994.
[CM08a]
Andrea Cal`ı and Davide Martinenghi. Conjunctive Query Containment Under Access Limitations. In Li et al. [LSYO08], pages 326–340.
[CM08b]
Andrea Cal`ı and Davide Martinenghi. Querying Data under Access Limitations. In ICDE [DBL08], pages 50–59.
[CM10]
Andrea Cal`ı and Davide Martinenghi. Querying the Deep Web. In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, ¨ Alain L´eger, Felix Naumann, Anastasia Ailamaki, and Fatma Ozcan, editors, EDBT, volume 426 of ACM International Conference Proceeding Series, pages 724–727. ACM, 2010.
[CMRW07]
Roberto Chinnici, Jean-Jacques Moreau, Arthur Ryman, and Sanjiva Weerawarana. Web Services Description Language (WSDL) Version 2.0 Part 1: Core Language. http://www.w3.org/TR/wsdl20/, 2007. ¨ ¨ Common Object Request Broker Architecture. In Liu and Ozsu [LO09],
[COR09]
page 401. [Cre09]
Mache Creeger. Cloud Computing: An Overview. ACM Queue, 7(5):2, 2009.
[CRF03]
William W. Cohen, Pradeep D. Ravikumar, and Stephen E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In
234
[CSC97] [CWW+ 06]
[DBL01]
[DBL04]
[DBL08] [DBL09a] [DBL09b] [dBLPF06]
[dCFS04]
[DDO08]
[DFF+ 99]
[DFF+ 07]
[DFK+ 09]
Subbarao Kambhampati and Craig A. Knoblock, editors, IIWeb, pages 73–78, 2003. Tiziana Catarci, Giuseppe Santucci, and John Cardiff. Graphical Interaction with Heterogeneous Databases. VLDB J., 6(2):97–120, 1997. Huajun Chen, Yimin Wang, Heng Wang, Yuxin Mao, Jinmin Tang, Chunying Zhou, Aining Yin, and Zhaohui Wu. Towards a Semantic Web of Relational Databases: A Practical Semantic Toolkit and an In-Use Case from Traditional Chinese Medicine. In Cruz et al. [CDA+ 06], pages 750–763. 16th IEEE International Conference on Automated Software Engineering (ASE 2001), 26-29 November 2001, Coronado Island, San Diego, CA, USA. IEEE Computer Society, 2001. Proceedings of the IEEE International Conference on Web Services (ICWS’04), June 6-9, 2004, San Diego, California, USA. IEEE Computer Society, 2004. Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Canc´ un, M´exico. IEEE, 2008. IEEE International Conference on Web Services, ICWS 2009, Los Angeles, CA, USA, 6-10 July 2009. IEEE, 2009. Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China. IEEE, 2009. Jos de Bruijn, Holger Lausen, Axel Polleres, and Dieter Fensel. The Web Service Modeling Language WSML: An Overview. In York Sure and John Domingue, editors, ESWC, volume 4011 of Lecture Notes in Computer Science, pages 590–604. Springer, 2006. Augusto de Carvalho Fontes and F´abio Soares Silva. SmartCrawl: A New Strategy for the Exploration of the Hidden Web. In Alberto H. F. Laender, Dongwon Lee, and Marc Ronthaler, editors, WIDM, pages 9–15. ACM, 2004. Remco M. Dijkman, Marlon Dumas, and Chun Ouyang. Semantics and Analysis of Business Process Models in BPMN. Inf. Softw. Technol., 50(12):1281–1294, 2008. Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. XML-QL: A Query Language for XML. In 8th. WWW Conference. W3C, 1999. World Wide Web Consortium Technical Report, NOTE-xmlql-19980819, www.w3.org/TR/NOTE-xml-ql. Denise Draper, Peter Fankhauser, Mary Fern´andez, Ashok Malhotra, Kristoffer Rose, Michael Rys, J´erˆome Sim´eon, and Philip Wadler. XQuery 1.0 and XPath 2.0 Formal Semantics. http://www.w3.org/TR/xquerysemantics/, 2007. Cristian Duda, Gianni Frey, Donald Kossmann, Reto Matter, and Chong
235
Zhou. AJAX Crawl: Making AJAX Applications Searchable. In ICDE [DBL09b], pages 78–89. [DFKR99]
Hasan Davulcu, Juliana Freire, Michael Kifer, and I. V. Ramakrishnan. A Layered Architecture for Querying Dynamic Web Content. In Alex Delis, Christos Faloutsos, and Shahram Ghandeharizadeh, editors, SIGMOD Conference, pages 491–502. ACM Press, 1999.
[DG04]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004.
[DP90]
Xiaotie Deng and Christos H. Papadimitriou. Exploring an Unknown Graph (Extended Abstract). In FOCS, volume I, pages 355–361. IEEE, 1990.
[DS05]
Martin J. D¨ urst and Michel Suignard. RFC 3987, Internationalized Resource Identifiers (IRIs), 2005.
[DVN04]
Hasan Davulcu, Srinivas Vadrevu, and Saravanakumar Nagarajan. OntoMiner: Bootstrapping Ontologies from Overlapping Domain Specific Web Sites. In Stuart I. Feldman, Mike Uretsky, Marc Najork, and Craig E. Wills, editors, WWW (Alternate Track Papers & Posters), pages 500–501. ACM, 2004.
[dVNP08]
Elena del Val Noguera and Miguel Rebollo Pedruelo. A Survey on Web Service Discovering and Composition. In Jos´e Cordeiro, Joaquim Filipe, and Slimane Hammoudi, editors, WEBIST (1), pages 135–142. INSTICC Press, 2008.
[DYKR00]
Hasan Davulcu, Guizhen Yang, Michael Kifer, and I. V. Ramakrishnan. Design and Implementation of the Physical Layer in WebBases: The XRover Experience. In Lloyd et al. [LDF+ 00], pages 1094–1105.
[DYM06]
Eduard C. Dragut, Clement T. Yu, and Weiyi Meng. Meaningful Labeling of Integrated Query Interfaces. In Umeshwar Dayal, Kyu-Young Whang, David B. Lomet, Gustavo Alonso, Guy M. Lohman, Martin L. Kersten, Sang Kyun Cha, and Young-Kuk Kim, editors, VLDB, pages 679–690. ACM, 2006.
[EFHS06]
Thomas Eiter, Enrico Franconi, Ralph Hodgson, and Susie Stephens, editors. Rules and Rule Markup Languages for the Semantic Web, Second International Conference, RuleML 2006, Athens, Georgia, USA, November 10-11, 2006, Proceedings. IEEE Computer Society, 2006.
[EGI99]
David Eppstein, Zvi Galil, and Giuseppe F. Italiano. Dynamic Graph Algorithms. In Mikhail J. Atallah, editor, Algorithms and Theory of Computation Handbook, chapter 8. CRC Press, 1999.
[EH00]
Gregor Engels and Reiko Heckel. Graph Transformation as a Conceptual and Formal Framework for System Modeling and Model Evolution. In
236
[EMK+ 04]
[FGM+ 99]
[FHK+ 97]
[Fie00]
[FK04]
[FKL01]
[Fla98] [FLM98]
[FM07] [FMRS07]
[FMS08]
[Gel85] [GH01]
Ugo Montanari, Jos´e D. P. Rolim, and Emo Welzl, editors, ICALP, volume 1853 of Lecture Notes in Computer Science, pages 127–150. Springer, 2000. Andrew Eisenberg, Jim Melton, Krishna G. Kulkarni, Jan-Eike Michels, and Fred Zemke. SQL: 2003 Has Been Published. SIGMOD Record, 33(1):119–126, 2004. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. RFC 2616 – Hypertext Transfer Protocol – HTTP/1.1. http://www.faqs.org/rfcs/rfc2616.html, 1999. J¨ urgen Frohn, Rainer Himmer¨oder, Paul-Thomas Kandzia, Georg Lausen, and Christian Schlepphorst. FLORID: A Prototype for F-Logic. In W. A. Gray and Per-˚ Ake Larson, editors, ICDE, page 583. IEEE Computer Society, 1997. Roy Thomas Fielding. Architectural Styles and the Design of Networkbased Software Architectures. PhD thesis, University of California, Irvine, Irvine, California, 2000. Ian Foster and Carl Kesselman, editors. The Grid 2. Morgan Kaufmann Publishers, ISBN 1-55860-933-4, San Francisco, second edition edition, 2004. Juliana Freire, Bharat Kumar, and Daniel F. Lieuwen. WebViews: Accessing Personalized Web Content and Services. In WWW, pages 576–586, 2001. David Flanagan. JavaScript: The Definitive Guide. O’Reilly & Associates, Inc., Sebastopol, CA, USA, 1998. Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59–74, 1998. Alessandro Folli and Tom Mens. Refactoring of UML Models Using AGG. ECEASST, 8, 2007. Christian Fuss, Christof Mosler, Ulrike Ranger, and Erhard Schultchen. The Jury Is Still Out: A Comparison of AGG, Fujaba, and PROGRES. ECEASST, 6, 2007. Oliver Fritzen, Wolfgang May, and Franz Schenk. Markup and Component Interoperability for Active Rules. In Diego Calvanese and Georg Lausen, editors, RR, volume 5341 of Lecture Notes in Computer Science, pages 197–204. Springer, 2008. David Gelernter. Generative Communication in Linda. ACM Trans. Program. Lang. Syst., 7(1):80–112, 1985. Dimitra Giannakopoulou and Klaus Havelund. Automata-Based Verification of Temporal Properties on Running Programs. In ASE [DBL01], pages 412–416.
237
[GH05]
[GHJV95]
[GHM+ 07a]
[GHM+ 07b]
[GHM+ 08]
[GKB+ 04]
[GKP03] [GKP05]
[GL81] [Gog09a] [Gog09b] [Gol91] [GSMT09]
[GZC07]
238
Andrew V. Goldberg and Chris Harrelson. Computing the Shortest Path: A∗ Search Meets Graph Theory. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 156–165, 2005. Erich Gamma, Richard Helm, Ralph E. Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. AddisonWesley, Reading, MA, 1995. Martin Gudgin, Marc Hadley, Noah Mendelsohn, Jean-Jacques Moreau, Henrik Frystyk Nielsen, Anish Karmarkar, and Yves Lafon. SOAP Version 1.2 Part 1: Messaging Framework (Second Edition). http://www.w3.org/ TR/soap12-part1/, 2007. Martin Gudgin, Marc Hadley, Noah Mendelsohn, Jean-Jacques Moreau, Henrik Frystyk Nielsen, Anish Karmarkar, and Yves Lafon. SOAP Version 1.2 Part 2: Adjuncts (Second Edition). http://www.w3.org/TR/ soap12-part2/, 2007. Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter F. Patel-Schneider, and Ulrike Sattler. OWL 2: The Next Step for OWL. J. Web Sem., 6(4):309–322, 2008. Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, and Sergio Flesca. The Lixto Data Extraction Project - Back and Forth Between Theory and Practice. In ACM Symposium on Principles of Database Systems (PODS), pages 1–12, 2004. Georg Gottlob, Christoph Koch, and Reinhard Pichler. The Complexity of XPath Query Evaluation. In PODS, pages 179–190. ACM, 2003. Georg Gottlob, Christoph Koch, and Reinhard Pichler. Efficient Algorithms for Processing XPath Queries. ACM Trans. Database Syst., 30(2):444–491, 2005. Hartmann J. Genrich and Kurt Lautenbach. System Modelling with HighLevel Petri Nets. Theor. Comput. Sci., 13:109–136, 1981. ¨ ¨ Martin Gogolla. Object Constraint Language. In Liu and Ozsu [LO09], pages 1927–1929. ¨ ¨ Martin Gogolla. Unified Modeling Language. In Liu and Ozsu [LO09], pages 3232–3239. Charles F. Goldfarb. The SGML Handbook. Oxford University Press, New York, NY, USA, 1991. Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson. W3C XML Schema Definition Language (XSD) 1.1 Part 1: Structures. http://www.w3.org/TR/xmlschema11-1/, 2009. Max Goebel, Viktor Zigo, and Michal Ceresna. Digging the Wild Web: An Interactive Tool for Web Data Consolidation. In Benatallah et al. [BCG+ 07], pages 613–622.
[Hal01]
Alon Y. Halevy. Answering Queries Using Views: A Survey. VLDB J., 10(4):270–294, 2001.
[Hay04]
Patrick Hayes. RDF Semantics. http://www.w3.org/TR/rdf-mt/, 2004.
[HBW09]
Markus Held, Wolfgang Blochinger, and Moritz Werning. E-Biology Workflows with Calvin. In Gottfried Vossen, Darrell D. E. Long, and Jeffrey Xu Yu, editors, WISE, volume 5802 of Lecture Notes in Computer Science, pages 581–588. Springer, 2009.
[Hen09]
James A. Hendler. Tonight’s Dessert: Semantic Web Layer Cakes. In Lora Aroyo, Paolo Traverso, Fabio Ciravegna, Philipp Cimiano, Tom Heath, Eero Hyv¨onen, Riichiro Mizoguchi, Eyal Oren, Marta Sabou, and Elena Paslaru Bontas Simperl, editors, ESWC, volume 5554 of Lecture Notes in Computer Science, page 1. Springer, 2009.
[HGMN+ 97]
Joachim Hammer, Hector Garcia-Molina, Svetlozar Nestorov, Ramana Yerneni, Markus M. Breunig, and Vasilis Vassalos. Template-Based Wrappers in the TSIMMIS System. In Joan Peckham, editor, SIGMOD Conference, pages 532–535. ACM Press, 1997.
[HH02]
Ian Horrocks and James A. Hendler, editors. The Semantic Web - ISWC 2002, First International Semantic Web Conference, Sardinia, Italy, June 9-12, 2002, Proceedings, volume 2342 of Lecture Notes in Computer Science. Springer, 2002.
[HKB08]
Wolfgang Holzinger, Bernhard Kr¨ upl, and Robert Baumgartner. Exploiting Semantic Web Technologies to Model Web Form Interactions. In Jinpeng Huai, Robin Chen, Hsiao-Wuen Hon, Yunhao Liu, Wei-Ying Ma, Andrew Tomkins, and Xiaodong Zhang, editors, WWW, pages 1145–1146. ACM, 2008.
[HKGM08]
Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Can Social Bookmarking Improve Web Search? In Marc Najork, Andrei Z. Broder, and Soumen Chakrabarti, editors, WSDM, pages 195–206. ACM, 2008.
[HKL08]
Thomas Hornung, Agnes Koschmider, and Georg Lausen. Recommendation Based Process Modeling Support: Method and User Experience. In Li et al. [LSYO08], pages 265–278.
[HKM06]
Thomas Hornung, Agnes Koschmider, and Jan Mendling. Integration of Heterogeneous BPM Schemas: The Case of XPDL and BPEL. In Nacer Boudjlida, Dong Cheng, and Nicolas Guelfi, editors, CAiSE Forum, volume 231 of CEUR Workshop Proceedings. CEUR-WS.org, 2006.
[HKS+ 05]
Jan Hidders, Natalia Kwasnikowska, Jacek Sroka, Jerzy Tyszkiewicz, and Jan Van den Bussche. Petri Net + Nested Relational Calculus = Dataflow. In Robert Meersman, Zahir Tari, Mohand-Said Hacid, John Mylopoulos, ¨ Barbara Pernici, Ozalp Babaoglu, Hans-Arno Jacobsen, Joseph P. Loyall, Michael Kifer, and Stefano Spaccapietra, editors, OTM Conferences
239
(1), volume 3760 of Lecture Notes in Computer Science, pages 220–237. Springer, 2005. [HKS06]
Ian Horrocks, Oliver Kutz, and Ulrike Sattler. The Even More Irresistible SROIQ. In Patrick Doherty, John Mylopoulos, and Christopher A. Welty, editors, KR, pages 57–67. AAAI Press, 2006.
[HKS+ 08]
Jan Hidders, Natalia Kwasnikowska, Jacek Sroka, Jerzy Tyszkiewicz, and Jan Van den Bussche. DFL: A Dataflow Language Based On Petri Nets and Nested Relational Calculus. Inf. Syst., 33(3):261–284, 2008.
[HM07]
Darris Hupp and Robert C. Miller. Smart Bookmarks: Automatic Retroactive Macro Recording on the Web. In Chia Shen, Robert J. K. Jacob, and Ravin Balakrishnan, editors, UIST, pages 81–90. ACM, 2007.
[HM09a]
Thomas Hornung and Wolfgang May. Deep Web Queries in a Semantic Web Environment. In Witold Abramowicz and Dominik Flejter, editors, BIS (Workshops), volume 37 of Lecture Notes in Business Information Processing, pages 39–50. Springer, 2009.
[HM09b]
Thomas Hornung and Wolfgang May. Ontology-Based Support for Graph Algorithms in Online Exploration Workflows. In Robert Meersman, Pilar Herrero, and Tharam S. Dillon, editors, OTM Workshops, volume 5872 of Lecture Notes in Computer Science, pages 13–14. Springer, 2009.
[HM09c]
Thomas Hornung and Wolfgang May. Semantic Annotations and Querying of Web Data Sources. In Robert Meersman, Tharam S. Dillon, and Pilar Herrero, editors, OTM Conferences (1), volume 5870 of Lecture Notes in Computer Science, pages 112–129. Springer, 2009.
[HML09]
Thomas Hornung, Wolfgang May, and Georg Lausen. Process AlgebraBased Query Workflows. In Pascal van Eck, Jaap Gordijn, and Roel Wieringa, editors, CAiSE, volume 5565 of Lecture Notes in Computer Science, pages 440–454. Springer, 2009.
[HMS10]
Thomas Hornung, Wolfgang May, and Daniel Schubert. A Configurable Graph Data Type for Online Exploration of Web Data. In Mirjana Ivanovi´c, Bernhard Thalheim, Barbara Catania, and Zoran Budimac, editors, GraphQ, pages 1–10. University of Novi Sad, Faculty of Sciences, Department of Mathematics and Informatics, 2010.
[HMYW04]
Hai He, Weiyi Meng, Clement T. Yu, and Zonghuan Wu. Automatic Integration of Web Search Interfaces with WISE-integrator. VLDB J., 13(3):256–273, 2004.
[HMYW05]
Hai He, Weiyi Meng, Clement T. Yu, and Zonghuan Wu. Constructing Interface Schemas for Search Interfaces of Web Databases. In Anne H. H. Ngu, Masaru Kitsuregawa, Erich J. Neuhold, Jen-Yao Chung, and Quan Z. Sheng, editors, WISE, volume 3806 of Lecture Notes in Computer Science, pages 29–42. Springer, 2005.
240
[Hoa85]
C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985.
[HPPSH05]
Ian Horrocks, Bijan Parsia, Peter F. Patel-Schneider, and James A. Hendler. Semantic Web Architecture: Stack or Two Towers? In Fran¸cois Fages and Sylvain Soliman, editors, PPSWR, volume 3703 of Lecture Notes in Computer Science, pages 37–41. Springer, 2005.
[HPS04]
Ian Horrocks and Peter F. Patel-Schneider. Reducing OWL Entailment to Description Logic Satisfiability. J. Web Sem., 1(4):345–357, 2004.
[HPZC07]
Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang. Accessing the Deep Web. Commun. ACM, 50(5):94–101, 2007.
[HR01]
Klaus Havelund and Grigore Rosu. Monitoring Programs Using Rewriting. In ASE [DBL01], pages 135–143.
[HR02]
Klaus Havelund and Grigore Rosu. Synthesizing Monitors for Safety Properties. In Joost-Pieter Katoen and Perdita Stevens, editors, TACAS, volume 2280 of Lecture Notes in Computer Science, pages 342–356. Springer, 2002.
[HRBA09]
Joshua M. Hailpern, Loretta Guarino Reid, Richard Boardman, and Srinivas Annam. Web 2.0: Blind to an Accessible New World. In Quemada et al. [QLMN09], pages 821–830.
[HS99]
Ian Horrocks and Ulrike Sattler. A Description Logic with Transitive and Inverse Roles and Role Hierarchies. J. Log. Comput., 9(3):385–410, 1999.
[HS10]
Steve Harris and Andy Seaborne. SPARQL Query Language 1.1. http: //www.w3.org/TR/sparql11-query/, 2010.
[HSL06]
Thomas Hornung, Kai Simon, and Georg Lausen. Information Gathering in a Dynamic World. In Jos´e J´ ulio Alferes, James Bailey, Wolfgang May, and Uta Schwertel, editors, PPSWR, volume 4187 of Lecture Notes in Computer Science, pages 237–241. Springer, 2006.
[HSL08a]
Thomas Hornung, Kai Simon, and Georg Lausen. Mashing Up the Deep Web – Research in Progress. In Jos´e Cordeiro, Joaquim Filipe, and Slimane Hammoudi, editors, WEBIST (2), pages 58–66. INSTICC Press, 2008.
[HSL08b]
Thomas Hornung, Kai Simon, and Georg Lausen. Mashups over the Deep Web. In Jos´e Cordeiro, Slimane Hammoudi, and Joaquim Filipe, editors, WEBIST (Selected Papers), volume 18 of Lecture Notes in Business Information Processing, pages 228–241. Springer, 2008.
[HST99]
Ian Horrocks, Ulrike Sattler, and Stephan Tobies. Practical reasoning for expressive description logics. In Harald Ganzinger, David A. McAllester, and Andrei Voronkov, editors, LPAR, volume 1705 of Lecture Notes in Computer Science, pages 161–180. Springer, 1999. Eric A. Hansen and Shlomo Zilberstein. LAO* : A Heuristic Search Al-
[HZ01]
241
gorithm that Finds Solutions with Loops. Artif. Intell., 129(1-2):35–62, 2001. [IAE04]
Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting Top-K Join Queries in Relational Databases. VLDB J., 13(3):207–221, 2004.
[ICW07]
2007 IEEE International Conference on Web Services (ICWS 2007), July 9-13, 2007, Salt Lake City, Utah, USA. IEEE Computer Society, 2007.
[Jen96]
Kurt Jensen. An Introduction to the Practical Use of Coloured Petri Nets. In Reisig and Rozenberg [RR98b], pages 237–292.
[JFGK03]
Julian Jang, Alan Fekete, Paul Greenfield, and Dean Kuo. Expressiveness of Workflow Description Languages. In Liang-Jie Zhang, editor, ICWS, pages 104–110. CSREA Press, 2003.
[JKL+ 04]
Nikeeta Julasana, Akshat Khandelwal, Anupama Lolage, Prabhdeep Singh, Priyanka Vasudevan, Hasan Davulcu, and I. V. Ramakrishnan. WinAgent: A System for Creating and Executing Personal Information Assistants Using a Web Browser. In Jean Vanderdonckt, Nuno Jardim Nunes, and Charles Rich, editors, IUI, pages 356–357. ACM, 2004.
[JYL+ 09]
Lei Ji, Jun Yan, Ning Liu, Wen Zhang, Weiguo Fan, and Zheng Chen. ExSearch: A Novel Vertical Search Engine for Online Barter Business. In David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy J. Lin, editors, CIKM, pages 1357–1366. ACM, 2009.
[KDYL09]
Thomas Kabisch, Eduard Constantin Dragut, Clement T. Yu, and Ulf Leser. A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration. PVLDB, 2(1):325–336, 2009.
[Kep04]
Stephan Kepser. A Simple Proof for the Turing-Completeness of XSLT and XQuery. In Proceedings of the Extreme Markup Languages 2004 Conference, Montr´eal, Quebec, Canada, 2004.
[KGG+ 07]
Christoph Koch, Johannes Gehrke, Minos N. Garofalakis, Divesh Srivastava, Karl Aberer, Anand Deshpande, Daniela Florescu, Chee Yong Chan, Venkatesh Ganti, Carl-Christian Kanne, Wolfgang Klas, and Erich J. Neuhold, editors. Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007. ACM, 2007.
[Kif95]
Michael Kifer. Deductive and Object Data Languages: A Quest for Integration. In Tok Wang Ling, Alberto O. Mendelzon, and Laurent Vieille, editors, DOOD, volume 1013 of Lecture Notes in Computer Science, pages 187–212. Springer, 1995.
[KKZ09]
Matthias Klusch, Patrick Kapahnke, and Ingo Zinnikus. SAWSDL-MX2: A Machine-Learning Approach for Integrating Semantic Web Service Matchmaking Variants. In ICWS [DBL09a], pages 335–342.
242
[KL89]
Michael Kifer and Georg Lausen. F-Logic: A Higher-Order Language for Reasoning about Objects, Inheritance, and Scheme. In James Clifford, Bruce G. Lindsay, and David Maier, editors, SIGMOD Conference, pages 134–146. ACM Press, 1989.
[KL07]
Il-Woong Kim and Kyong-Ho Lee. Describing Semantic Web Services: From UML to OWL-S. In ICWS07 [ICW07], pages 529–536.
[KLW95]
Michael Kifer, Georg Lausen, and James Wu. Logical Foundations of Object-Oriented and Frame-Based Languages. J. ACM, 42(4):741–843, 1995.
[KR88]
Brian W. Kernighan and Dennis Ritchie. The C Programming Language, Second Edition. Prentice-Hall, 1988.
[Kru56]
Joseph Kruskal. On the Shortest Spanning Subtree and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1):48–50, 1956.
[LAB+ 06]
Bertram Lud¨ascher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew B. Jones, Edward A. Lee, Jing Tao, and Yang Zhao. Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice and Experience, 18(10):1039–1065, 2006.
[Lau05]
Georg Lausen. Datenbanken: Grundlagen und XML - Technologien. Spektrum Akademischer Verlag, 2005.
[LC01]
Chen Li and Edward Y. Chang. Answering Queries with Useful Bindings. ACM Trans. Database Syst., 26(3):313–343, 2001.
[LDF+ 00]
John W. Lloyd, Ver´onica Dahl, Ulrich Furbach, Manfred Kerber, KungKiu Lau, Catuscia Palamidessi, Lu´ıs Moniz Pereira, Yehoshua Sagiv, and Peter J. Stuckey, editors. Computational Logic - CL 2000, First International Conference, London, UK, 24-28 July, 2000, Proceedings, volume 1861 of Lecture Notes in Computer Science. Springer, 2000.
[LdSGL02]
Juliano Palmieri Lage, Altigran Soares da Silva, Paulo Braz Golgher, and Alberto H. F. Laender. Collecting Hidden Web Pages for Data Extraction. In Roger H. L. Chiang and Ee-Peng Lim, editors, WIDM, pages 69–75. ACM, 2002.
[LdSGL04]
Juliano Palmieri Lage, Altigran Soares da Silva, Paulo Braz Golgher, and Alberto H. F. Laender. Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data Knowl. Eng., 49(2):177– 196, 2004.
[Len02]
Maurizio Lenzerini. Data Integration: A Theoretical Perspective. In Lucian Popa, editor, PODS, pages 233–246. ACM, 2002.
[LESY02]
Stephen W. Liddle, David W. Embley, Del T. Scott, and Sai Ho Yau. Extracting Data Behind Web Forms. In Stefano Spaccapietra, Salvatore T.
243
March, and Yahiko Kambayashi, editors, ER (Workshops), volume 2503 of Lecture Notes in Computer Science, pages 402–413. Springer, 2002. [LHML08]
Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa A. Lau. CoScripter: Automating & Sharing How-To Knowledge in the Enterprise. In Mary Czerwinski, Arnold M. Lund, and Desney S. Tan, editors, CHI, pages 1719–1728. ACM, 2008.
[Li03]
Chen Li. Computing Complete Answers to Queries in the Presence of Limited Access Patterns. VLDB J., 12(3):211–227, 2003.
[LNL+ 10]
Ian Li, Jeffrey Nichols, Tessa A. Lau, Clemens Drews, and Allen Cypher. Here’s What I Did: Sharing and Reusing Web Activity with ActionShot. In Elizabeth D. Mynatt, Don Schoner, Geraldine Fitzpatrick, Scott E. Hudson, Keith Edwards, and Tom Rodden, editors, CHI, pages 723–732. ACM, 2010.
[LO03]
Kirsten Lenz and Andreas Oberweis. Inter-Organizational Business Process Management with XML Nets. In Hartmut Ehrig, Wolfgang Reisig, Grzegorz Rozenberg, and Herbert Weber, editors, Petri Net Technology for Communication-Based Systems, volume 2472 of Lecture Notes in Computer Science, pages 243–263. Springer, 2003. ¨ Ling Liu and M. Tamer Ozsu, editors. Encyclopedia of Database Systems.
¨ [LO09]
Springer US, 2009. [LRNdST02]
Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran Soares da Silva, and Juliana S. Teixeira. A Brief Survey of Web Data Extraction Tools. SIGMOD Record, 31(2):84–93, 2002.
[LRO96]
Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. In Vijayaraman et al. [VBMS96], pages 251–262.
[LS04]
Wolfgang Lehner and Harald Sch¨oning. XQuery. dpunkt, 2004.
[LSQC05]
Ying Li, Kewei Sun, Jie Qiu, and Ying Chen. Self-Reconfiguration of Service-Based Systems: A Case Study for Service Level Agreements and Resource Optimization. In ICWS, pages 266–273. IEEE Computer Society, 2005.
[LSYO08]
Qing Li, Stefano Spaccapietra, Eric S. K. Yu, and Antoni Oliv´e, editors. Conceptual Modeling - ER 2008, 27th International Conference on Conceptual Modeling, Barcelona, Spain, October 20-24, 2008. Proceedings, volume 5231 of Lecture Notes in Computer Science. Springer, 2008.
[LYRS07]
Jianguo Lu, Yijun Yu, Debashis Roy, and Deepa Saha. Web Service Composition: A Reality Check. In Benatallah et al. [BCG+ 07], pages 523–532.
[LYV+ 98]
Chen Li, Ramana Yerneni, Vasilis Vassalos, Hector Garcia-Molina, Yannis Papakonstantinou, Jeffrey D. Ullman, and Murty Valiveti. Capability
244
Based Mediation in TSIMMIS. In Laura M. Haas and Ashutosh Tiwary, editors, SIGMOD Conference, pages 564–566. ACM Press, 1998. [MAA05a]
Wolfgang May, Jos´e J´ ulio Alferes, and Ricardo Amador. Active Rules in the Semantic Web: Dealing with Language Heterogeneity. In Asaf Adi, Suzette Stoutenburg, and Said Tabet, editors, RuleML, volume 3791 of Lecture Notes in Computer Science, pages 30–44. Springer, 2005.
[MAA05b]
Wolfgang May, Jos´e J´ ulio Alferes, and Ricardo Amador. Active Rules in the Semantic Web: Dealing with Language Heterogeneity. In Rule Markup Languages (RuleML), number 3791 in LNCS, pages 30–44. Springer, 2005.
[MAA05c]
Wolfgang May, Jos´e J´ ulio Alferes, and Ricardo Amador. An Ontologyand Resources-Based Approach to Evolution and Reactivity in the Semantic Web. In Robert Meersman, Zahir Tari, Mohand-Said Hacid, ¨ John Mylopoulos, Barbara Pernici, Ozalp Babaoglu, Hans-Arno Jacobsen, Joseph P. Loyall, Michael Kifer, and Stefano Spaccapietra, editors, OTM Conferences (2), volume 3761 of Lecture Notes in Computer Science, pages 1553–1570. Springer, 2005.
[MAAH09]
Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y. Halevy. Harnessing the Deep Web: Present and Future. In CIDR. www.crdrdb.org, 2009.
[May09]
Wolfgang May. A Database-based Service for Handling Logical Variable Bindings. In Database-as-a-Service Workshop, pages 13–25, 2009.
[MBM+ 07]
David L. Martin, Mark H. Burstein, Drew V. McDermott, Sheila A. McIlraith, Massimo Paolucci, Katia P. Sycara, Deborah L. McGuinness, Evren Sirin, and Naveen Srinivasan. Bringing Semantics to Web Services with OWL-S. In World Wide Web, pages 243–277, 2007.
[MGH+ 09]
Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, and Carsten Lutz. OWL 2 Web Ontology Language Profiles. http://www.w3.org/TR/owl2-profiles/, 2009.
[MHS09]
Boris Motik, Ian Horrocks, and Ulrike Sattler. Bridging the Gap Between OWL and Relational Databases. J. Web Sem., 7(2):74–89, 2009.
[Mic08]
Sun Microsystems. Java Simple Data Format. http://java.sun.com/ javase/6/docs/api/java/text/SimpleDateFormat.html, 2008.
[Mil83]
Robin Milner. Calculi for Synchrony and Asynchrony. Theor. Comput. Sci., 25:267–310, 1983.
[Mil89]
Robin Milner. Communication and Concurrency. Prentice Hall, 1989.
[ML07]
Nilo Mitra and Yves Lafon. SOAP Version 1.2 Part 0: Primer (Second Edition). http://www.w3.org/TR/soap12-part0/, 2007.
[MM04]
Frank Manola and Eric Miller. RDF Primer. http://www.w3.org/TR/ rdf-primer/, 2004.
245
[MN89]
Kurt Mehlhorn and Stefan N¨aher. LEDA: A Library of Efficient Data Types and Algorithms. In Antoni Kreczmar and Grazyna Mirkowska, editors, MFCS, volume 379 of Lecture Notes in Computer Science, pages 88–106. Springer, 1989.
[MPR+ 08a]
Paula Montoto, Alberto Pan, Juan Raposo, Jos´e Losada, Fernando Bellas, and Victor Carneiro. A Workflow Language for Web Automation. J. UCS, 14(11):1838–1856, 2008.
[MPR+ 08b]
Paula Montoto, Alberto Pan, Juan Raposo, Jos´e Losada, Fernando Bellas, and Javier L´opez. A Workflow-Based Approach for Creating Complex Web Wrappers. In James Bailey, David Maier, Klaus-Dieter Schewe, Bernhard Thalheim, and Xiaoyang Sean Wang, editors, WISE, volume 5175 of Lecture Notes in Computer Science, pages 396–409. Springer, 2008.
[MPR+ 09a]
Paula Montoto, Alberto Pan, Juan Raposo, Fernando Bellas, and Javier L´opez. Automating Navigation Sequences in AJAX Websites. In Martin Gaedke, Michael Grossniklaus, and Oscar D´ıaz, editors, ICWE, volume 5648 of Lecture Notes in Computer Science, pages 166–180. Springer, 2009.
[MPR+ 09b]
Paula Montoto, Alberto Pan, Juan Raposo, Fernando Bellas, and Javier L´opez. Web Navigation Sequences Automation in Modern Websites. In Sourav S. Bhowmick, Josef K¨ ung, and Roland Wagner, editors, DEXA, volume 5690 of Lecture Notes in Computer Science, pages 302–316. Springer, 2009.
[MPW07]
David Martin, Massimo Paolucci, and Matthias Wagner. Bringing Semantic Annotations to Web Services: OWL-S from the SAWSDL Perspective. In Karl Aberer, Key-Sun Choi, Natasha Fridman Noy, Dean Allemang, Kyung-Il Lee, Lyndon J. B. Nixon, Jennifer Golbeck, Peter Mika, Diana Maynard, Riichiro Mizoguchi, Guus Schreiber, and Philippe Cudr´e-Mauroux, editors, ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, pages 340–352. Springer, 2007.
[MS99]
Christopher D. Manning and Hinrich Schuetze. Foundations of Statistical Natural Language Processing. The MIT Press, 1 edition, JUN 1999.
[MS02]
Sheila A. McIlraith and Tran Cao Son. Adapting Golog for Composition of Semantic Web Services. In Dieter Fensel, Fausto Giunchiglia, Deborah L. McGuinness, and Mary-Anne Williams, editors, KR, pages 482–496. Morgan Kaufmann, 2002.
[MvH04]
Deborah L. McGuinness and Frank van Harmelen. OWL Web Ontology Language. http://www.w3.org/TR/owl-features/, 2004.
[NL04a]
Alan Nash and Bertram Lud¨ascher. Processing First-Order Queries under Limited Access Patterns. In Alin Deutsch, editor, PODS, pages 307–318. ACM, 2004.
[NL04b]
Alan Nash and Bertram Lud¨ascher. Processing Unions of Conjunctive
246
Queries with Negation under Limited Access Patterns. In Elisa Bertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, Klemens B¨ohm, and Elena Ferrari, editors, EDBT, volume 2992 of Lecture Notes in Computer Science, pages 422–440. Springer, 2004. [NMTM98]
Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell. Learning to Classify Text from Labeled and Unlabeled Documents. In AAAI/IAAI, pages 792–799, 1998.
[NNZ00]
Ulrich Nickel, J¨org Niere, and Albert Z¨ undorf. The FUJABA Environment. In ICSE, pages 742–745, 2000.
[NWM07]
Zaiqing Nie, Ji-Rong Wen, and Wei-Ying Ma. Object-level Vertical Search. In CIDR, pages 235–246. www.crdrdb.org, 2007.
[OGA+ 06]
Thomas M. Oinn, R. Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole A. Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip W. Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. Taverna: Lessons in Creating a Workflow Environment for the Life Sciences. Concurrency and Computation: Practice and Experience, 18(10):1067– 1100, 2006.
[ON10]
Christopher Olston and Marc Najork. Web crawling. Foundations and Trends in Information Retrieval, 4(3):175–246, 2010.
[ORS+ 08]
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In Jason Tsong-Li Wang, editor, SIGMOD Conference, pages 1099–1110. ACM, 2008.
[OS96]
Andreas Oberweis and Peter Sander. Information System Behavior Specification by High-Level Petri Nets. ACM Trans. Inf. Syst., 14(4):380–420, 1996.
[OVvdA+ 07]
Chun Ouyang, Eric Verbeek, Wil M. P. van der Aalst, Stephan Breutel, Marlon Dumas, and Arthur H. M. ter Hofstede. Formal Semantics and Analysis of Control Flow in WS-BPEL. Sci. Comput. Program., 67(23):162–198, 2007.
[PAA+ 02]
Steven Pemberton, Daniel Austin, Jonny Axelsson, Tantek C ¸ elik, Doug Dominiak, Herman Elenbaas, Beth Epperson, Masayasu Ishikawa, Shin’ichi Matsui, Shane McCarron, Ann Navarro Subramanian Peruvemba Rob Relyea Sebastian Schnitzenbaumer, and Peter Stark. XHTML 1.0 The Extensible HyperText Markup Language (Second Edition). http: //www.w3.org/TR/xhtml1/, 2002.
[PAG06]
Jorge P´erez, Marcelo Arenas, and Claudio Gutierrez. Semantics and Complexity of SPARQL. In Cruz et al. [CDA+ 06], pages 30–43.
247
[PAG09]
Jorge P´erez, Marcelo Arenas, and Claudio Gutierrez. Semantics and Complexity of SPARQL. ACM Trans. Database Syst., 34(3), 2009.
[PAGM96]
Yannis Papakonstantinou, Serge Abiteboul, and Hector Garcia-Molina. Object Fusion in Mediator Systems. In Vijayaraman et al. [VBMS96], pages 413–424.
[Pau05]
Linda Dailey Paulson. Building Rich Web Applications with AJAX. IEEE Computer, 38(10):14–17, 2005.
[Pet62]
C. A. Petri. Fundamentals of a Theory of Asynchronous Information Flow. In IFIP Congress, pages 386–390, 1962.
[PGM+ 09]
David Peterson, Shudi (Sandy) Gao, Ashok Malhotra, C. M. SperbergMcQueen, and Henry S. Thompson. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. http://www.w3.org/TR/ xmlschema11-2/, 2009.
[PGMW95]
Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer Widom. Object Exchange Across Heterogeneous Information Sources. In Philip S. Yu and Arbee L. P. Chen, editors, ICDE, pages 251–260. IEEE Computer Society, 1995.
[PH07]
Jeff Z. Pan and Ian Horrocks. RDFS(FA): Connecting RDF(S) and OWL DL. IEEE Trans. Knowl. Data Eng., 19(2):192–206, 2007.
[PKPS02]
Massimo Paolucci, Takahiro Kawamura, Terry R. Payne, and Katia P. Sycara. Semantic Matching of Web Services Capabilities. In Horrocks and Hendler [HH02], pages 333–347.
[Pos82]
Jonathan B. Postel. RFC 821 – Simple Mail Transfer Protocol. http: //www.faqs.org/rfcs/rfc821.html, 1982.
[PPH+ 09]
´ + 02] [PRA
Danh Le Phuoc, Axel Polleres, Manfred Hauswirth, Giovanni Tummarello, and Christian Morbidoni. Rapid Prototyping of Semantic Mash-Ups Through Semantic Web Pipes. In Quemada et al. [QLMN09], pages 581– 590. ´ ´ Alberto Pan, Juan Raposo, Manuel Alvarez, Justo Hidalgo, and Angel
´ + 07] [PRA
Vi˜ na. Semi-Automatic Wrapper Generation for Commercial Web Sources. In Colette Rolland, Sjaak Brinkkemper, and Motoshi Saeki, editors, Engineering Information Systems in the Internet Context, volume 231 of IFIP Conference Proceedings, pages 265–283. Kluwer, 2002. ´ Alberto Pan, Juan Raposo, Manuel Alvarez, Victor Carneiro, and Fernando Bellas. Automatically Maintaining Navigation Sequences for Querying Semi-structured Web Sources. Data Knowl. Eng., 63(3):795– 810, 2007.
[Pri57]
248
R. C. Prim. Shortest Connection Networks and Some Generalisations. Bell System Technical Journal, 36(6):1389–1401, 1957.
[PS08]
Eric Prud’hommeaux and Andy Seaborne. SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/, 2008.
[PTA06]
Michael Pantazoglou, Aphrodite Tsalgatidou, and George Athanasopoulos. Quantified Matchmaking of Heterogeneous Services. In Karl Aberer, Zhiyong Peng, Elke A. Rundensteiner, Yanchun Zhang, and Xuhui Li, editors, WISE, volume 4255 of Lecture Notes in Computer Science, pages 144–155. Springer, 2006.
[PUHM09]
H´ector P´erez-Urbina, Ian Horrocks, and Boris Motik. Efficient Query Answering for OWL 2. In Bernstein et al. [BKH+ 09], pages 489–504.
[QLDG09]
Ling Qian, Zhiguo Luo, Yujian Du, and Leitao Guo. Cloud Computing: An Overview. In Martin Gilje Jaatun, Gansen Zhao, and Chunming Rong, editors, CloudCom, volume 5931 of Lecture Notes in Computer Science, pages 626–631. Springer, 2009.
[QLMN09]
Juan Quemada, Gonzalo Le´on, Yo¨elle S. Maarek, and Wolfgang Nejdl, editors. Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20-24, 2009. ACM, 2009. ´ Juan Raposo, Manuel Alvarez, Jos´e Losada, and Alberto Pan. Main-
´ [RALP06]
taining Web Navigation Flows for Wrappers. In Juhnyoung Lee, Junho Shim, Sang goo Lee, Christoph Bussler, and Simon S. Y. Shim, editors, DEECS, volume 4055 of Lecture Notes in Computer Science, pages 100– 114. Springer, 2006. [RdBM+ 06]
Dumitru Roman, Jos de Bruijn, Adrian Mocan, Holger Lausen, John Domingue, Christoph Bussler, and Dieter Fensel. WWW: WSMO, WSML, and WSMX in a Nutshell. In Riichiro Mizoguchi, Zhongzhi Shi, and Fausto Giunchiglia, editors, ASWC, volume 4185 of Lecture Notes in Computer Science, pages 516–522. Springer, 2006.
[RGM00]
Sriram Raghavan and Hector Garcia-Molina. Crawling the hidden web. Technical Report 2000-36, Stanford InfoLab, 2000.
[RGM01]
Sriram Raghavan and Hector Garcia-Molina. Crawling the Hidden Web. In Apers et al. [AAC+ 01], pages 129–138.
[RHJ99]
Dave Raggett, Arnaud Le Hors, and Ian Jacobs. HTML 4.01 Specification. http://www.w3.org/TR/html4/, 1999.
[RK07]
Dumitru Roman and Michael Kifer. Reasoning about the Behavior of Semantic Web Services with Concurrent Transaction Logic. In Koch et al. [KGG+ 07], pages 627–638.
[RN03]
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2003.
[RR98a]
Wolfgang Reisig and Grzegorz Rozenberg, editors. Lectures on Petri Nets I: Basic Models, Advances in Petri Nets, the volumes are based on the
249
Advanced Course on Petri Nets, held in Dagstuhl, September 1996, volume 1491 of Lecture Notes in Computer Science. Springer, 1998. [RR98b]
Wolfgang Reisig and Grzegorz Rozenberg, editors. Lectures on Petri Nets II: Applications, Advances in Petri Nets, the volumes are based on the Advanced Course on Petri Nets, held in Dagstuhl, September 1996, volume 1492 of Lecture Notes in Computer Science. Springer, 1998.
[RR07]
Leonard Richardson and Sam Ruby. RESTful Web Services. O’Reilly, 2007.
[RS04]
Jinghai Rao and Xiaomeng Su. A Survey of Automated Web Service Composition Methods. In Jorge Cardoso and Amit P. Sheth, editors, SWSWPC, volume 3387 of Lecture Notes in Computer Science, pages 43– 54. Springer, 2004.
[RtHEvdA05] Nick Russell, Arthur H. M. ter Hofstede, David Edmond, and Wil M. P. van der Aalst. Workflow Data Patterns: Identification, Representation and Tool Support. In Lois M. L. Delcambre, Christian Kop, Heinrich C. Mayr, John Mylopoulos, and Oscar Pastor, editors, ER, volume 3716 of Lecture Notes in Computer Science, pages 353–368. Springer, 2005. [SBS04]
Gwen Sala¨ un, Lucas Bordeaux, and Marco Schaerf. Describing and Reasoning on Web Services using Process Algebra. In ICWS [DBL04], pages 43–54.
[Sch98]
Stefan Schuster. Knowledge Representation and Graph Transformation. In Hartmut Ehrig, Gregor Engels, Hans-J¨org Kreowski, and Grzegorz Rozenberg, editors, TAGT, volume 1764 of Lecture Notes in Computer Science, pages 228–237. Springer, 1998.
[Sch09a]
Michael Schmidt. Foundations of SPARQL Query Optimization. PhD thesis, Technische Fakult¨at, Institut f¨ ur Informatik, Albert-LudwigsUniversit¨at Freiburg, Georges-K¨ohler-Alle, Geb¨aude 051, Freiburg, i. Br., Germany, 2009.
[Sch09b]
Daniel Schubert. RDF Rules in the MARS Framework. Master’s thesis, Institute of Informatics, Georg-August-Universtit¨at G¨ottingen, Goldschmidtstrasse 7, 37077 G¨ottingen, April 2009.
[SG05]
Andrew Stellman and Jennifer Greene. Applied Software Project Management. O’Reilly Media, 2005.
[SHL06]
Kai Simon, Thomas Hornung, and Georg Lausen. Learning Rules to Preprocess Web Data for Automatic Integration. In Eiter et al. [EFHS06], pages 107–116.
[Sim08]
Kai Simon. ViPER: Visual Perception based Information Extraction of Structured Web Content. PhD thesis, Fakult¨at f¨ ur Angewandte Wissenschaften, Institut f¨ ur Informati, Albert-Ludwigs-Universit¨at Freiburg, Georges-K¨ohler-Alle, Geb¨aude 051, Freiburg, i. Br., Germany, 2008.
250
[SKDN05]
Srinath Shankar, Ameet Kini, David J. DeWitt, and Jeffrey F. Naughton. Integrating Databases and Workflow Systems. SIGMOD Record, 34(3):5– 11, 2005.
[SL05]
Kai Simon and Georg Lausen. ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In Otthein Herzog, Hans-J¨org Schek, Norbert Fuhr, Abdur Chowdhury, and Wilfried Teiken, editors, CIKM, pages 381–388. ACM, 2005.
[SMHY07]
Liangcai Shu, Weiyi Meng, Hai He, and Clement T. Yu. Querying Capability Modeling and Construction of Deep Web Sources. In Benatallah et al. [BCG+ 07], pages 13–25.
[SMT00]
C. M. Sperberg-McQueen and Henry Thompson. XML Schema. http: //www.w3.org/XML/Schema, 2000.
[SPH04]
Evren Sirin, Bijan Parsia, and James A. Hendler. Filtering and Selecting Semantic Web Services with Interactive Composition Techniques. IEEE Intelligent Systems, 19(4):42–49, 2004.
[SS09]
Kenn Scribner and Scott Seely. Addison-Wesley, 2009.
[Ste05]
Christian Stefansen. SMAWL: A SMall Workflow Language Based on CCS. In Orlando Belo, Johann Eder, Jo˜ao Falc˜ao e Cunha, and Oscar Pastor, editors, CAiSE Short Paper Proceedings, volume 161 of CEUR Workshop Proceedings. CEUR-WS.org, 2005.
[SWZ99]
Andy Schurr, Andreas J. Winter, and Albert Z¨ undorf. The PROGRES Approach: Language and Environment. In Handbook on Graph Grammars and Computing by Graph Transformation: Applications, Languages, and Tools, pages 487–550. World Scientific, 1999.
[Tae03]
Gabriele Taentzer. AGG: A Graph Transformation Environment for Modeling and Validation of Software. In John L. Pfaltz, Manfred Nagl, and Boris B¨ohlen, editors, AGTIVE, volume 3062 of Lecture Notes in Computer Science, pages 446–453. Springer, 2003.
[Tan92]
Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall, 1992.
[TCHJ04]
Yu Tang, Luo Chen, Kai-Tao He, and Ning Jing. SRN: An Extended Petri-Net-Based Workflow Model for Web Service Composition. In ICWS [DBL04], pages 591–599.
[TCS+ 09]
Wei Tan, Kyle Chard, Dinanath Sulakhe, Ravi K. Madduri, Ian T. Foster, Stian Soiland-Reyes, and Carole A. Goble. Scientific Workflows as Services in caGrid: A Taverna and gRAVI Approach. In ICWS [DBL09a], pages 413–420.
[TE09]
Christiane Taras and Thomas Ertl. Interaction with Colored Graphical Representations on Braille Devices. In Constantine Stephanidis, editor,
Effective REST Services via .NET.
251
HCI (5), volume 5614 of Lecture Notes in Computer Science, pages 164– 173. Springer, 2009. [TG07]
John T. E. Timm and Gerald C. Gannod. Specifying Semantic Web Service Compositions using UML and OCL. In ICWS07 [ICW07], pages 521–528.
[TS06]
Mike Thelwall and David Stuart. Web Crawling Ethics Revisited: Cost, Privacy, and Denial of Service. JASIST, 57(13):1771–1779, 2006.
[UC06]
The Unicode Consortium. The Unicode Standard, Version 5.0. AddisonWesley, Reading, MA, USA, 2006.
[Ull97]
Jeffrey D. Ullman. Information Integration Using Logical Views. In Foto N. Afrati and Phokion G. Kolaitis, editors, ICDT, volume 1186 of Lecture Notes in Computer Science, pages 19–40. Springer, 1997.
[VBMS96]
T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda, editors. VLDB’96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India. Morgan Kaufmann, 1996.
[VCNS08]
Roman Vacul´ın, Huajun Chen, Roman Neruda, and Katia P. Sycara. Modeling and Discovery of Data Providing Services. In ICWS, pages 54–61. IEEE Computer Society, 2008.
[vdA98]
Wil M. P. van der Aalst. The Application of Petri Nets to Workflow Management. Journal of Circuits, Systems, and Computers, 8(1):21–66, 1998.
[vdAADtH04] Wil M. P. van der Aalst, Lachlan Aldred, Marlon Dumas, and Arthur H. M. ter Hofstede. Design and Implementation of the YAWL System. In Anne Persson and Janis Stirna, editors, CAiSE, volume 3084 of Lecture Notes in Computer Science, pages 142–159. Springer, 2004. [vdAP06]
Wil M. P. van der Aalst and Maja Pesic. DecSerFlow: Towards a Truly Declarative Service Flow Language. In Bravetti et al. [BNZ06], pages 1–23.
[vdAtH05]
Wil M. P. van der Aalst and Arthur H. M. ter Hofstede. YAWL: Yet Another Workflow Language. Inf. Syst., 30(4):245–275, 2005.
[vdAtHKB03] Wil M. P. van der Aalst, Arthur H. M. ter Hofstede, Bartek Kiepuszewski, and Alistair P. Barros. Workflow Patterns. Distributed and Parallel Databases, 14(1):5–51, 2003. [vdAvH02]
Wil M. P. van der Aalst and Kees M. van Hee. Workflow Management: Models, Methods, and Systems. MIT Press, 2002.
[vDKV00]
Arie van Deursen, Paul Klint, and Joost Visser. Domain-Specific Languages: An Annotated Bibliography. SIGPLAN Notices, 35(6):26–36, 2000.
252
[VKVF08]
Tomas Vitvar, Jacek Kopeck´ y, Jana Viskova, and Dieter Fensel. WSMOLite Annotations for Web Services. In Sean Bechhofer, Manfred Hauswirth, J¨org Hoffmann, and Manolis Koubarakis, editors, ESWC, volume 5021 of Lecture Notes in Computer Science, pages 674–689. Springer, 2008.
[VSRM08]
Marko Vrhovnik, Holger Schwarz, Sylvia Radesch¨ utz, and Bernhard Mitschang. An Overview of SQL Support in Workflow Products. In ICDE [DBL08], pages 1287–1296.
[VSS+ 07]
Marko Vrhovnik, Holger Schwarz, Oliver Suhre, Bernhard Mitschang, Volker Markl, Albert Maier, and Tobias Kraft. An Approach to Optimize Data Processing in Business Processes. In Koch et al. [KGG+ 07], pages 615–626.
[Wan08]
Yang Wang. Deep Web Navigation by Example. Master’s thesis, Institute of Computer Science, Albert-Ludwigs University Freiburg, GeorgesK¨ohler-Allee, Geb. 51, 79110 Freiburg, March 2008. ¨ ¨ Web Crawler. In Liu and Ozsu [LO09], page 3462.
[Web09] [WH08a]
Yang Wang and Thomas Hornung. Deep Web Navigation by Example. In Dominik Flejter, Slawomir Grzonkowski, Tomasz Kaczmarek, Marek Kowalkiewicz, Tadhg Nagle, and Jonny Parkes, editors, BIS (Workshops), volume 333 of CEUR Workshop Proceedings, pages 131–140. CEURWS.org, 2008.
[WH08b]
Yang Wang and Thomas Hornung. Deep Web Navigation by Example. Scalable Computing, Practice and Experience, 9(4):281–292, 2008.
[Wie92]
Gio Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer, 25(3):38–49, 1992.
[Win03]
Dave Winer. XML-RPC Specification. http://www.xmlrpc.com/spec, 2003.
[WKD04]
Gerhard Weikum, Arnd Christian K¨onig, and Stefan Deßloch, editors. Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13-18, 2004. ACM, 2004.
[WL02]
Mark D. Wilkinson and Matthew Links. BioMOBY: An Open Source Biological Web Services Proposal. Briefings in Bioinformatics, 3(4):331– 341, 2002.
[WMLF98]
Peter Wyckoff, Stephen W. McLaughry, Tobin J. Lehman, and Daniel Alexander Ford. T Spaces. IBM Systems Journal, 37(3):454–474, 1998.
[WvdADtH03] Petia Wohed, Wil M. P. van der Aalst, Marlon Dumas, and Arthur H. M. ter Hofstede. Analysis of Web Services Composition Languages: The Case of BPEL4WS. In Il-Yeol Song, Stephen W. Liddle, Tok Wang Ling, and
253
[WYDM04]
[YK00]
[YK04]
[YKC06]
[ZBML09]
[ZCCK04]
[ZHC04]
[Zlo77]
254
Peter Scheuermann, editors, ER, volume 2813 of Lecture Notes in Computer Science, pages 200–215. Springer, 2003. Wensheng Wu, Clement T. Yu, AnHai Doan, and Weiyi Meng. An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. In Weikum et al. [WKD04], pages 95–106. Guizhen Yang and Michael Kifer. FLORA: Implementing an Efficient DOOD System Using a Tabling Logic Engine. In Lloyd et al. [LDF+ 00], pages 1078–1093. Xiaochuan Yi and Krys Kochut. A CP-Nets-Based Design and Verification Framework for Web Services Composition. In ICWS [DBL04], pages 756– 760. Guizhen Yang, Michael Kifer, and Vinay K. Chaudhri. Efficiently Ordering Subgoals with Access Constraints. In Stijn Vansummeren, editor, PODS, pages 183–192. ACM, 2006. Daniel Zinn, Shawn Bowers, Timothy M. McPhillips, and Bertram Lud¨ascher. X-CSR: Dataflow Optimization for Distributed XML Process Pipelines. In ICDE [DBL09b], pages 577–580. Jia Zhang, Jen-Yao Chung, Carl K. Chang, and Seongwoon Kim. WSNet: A Petri-Net Based Specification Model for Web Services. In ICWS [DBL04], pages 420–427. Zhen Zhang, Bin He, and Kevin Chen-Chuan Chang. Understanding Web Query Interfaces: Best-effort Parsing with Hidden Syntax. In Weikum et al. [WKD04], pages 107–118. Mosh´e M. Zloof. Query-by-Example: A Data Base Language. IBM Systems Journal, 16(4):324–343, 1977.