Modeling and Querying Data Warehouses on the Semantic Web using QB4OLAP Lorena Etcheverry1 Alejandro Vaisman2 Esteban Zim´anyi3 1
2
Universidad de la Rep´ublica, Uruguay
[email protected] Instituto Tecnol´ogico de Buenos Aires, Argentina
[email protected] 3 Universit´e Libre de Bruxelles
[email protected]
Abstract. The web is changing the way in which data warehouses are designed and exploited. Nowadays, for many data analysis tasks, data contained in a conventional data warehouse may not suffice, and external data sources, like the web, can provide useful multidimensional information. Large repositories of semantically annotated data are becoming available on the web, opening new opportunities for enhancing current decision-support systems. The data warehousing technology must be prepared to handle semantic web data. Representation of multidimensional data in RDF is crucial to achieve such goal. In this paper we present a set of rules that, given a conceptual data warehouse model, translates its schema to an RDFbased multidimensional model, and populate the result with the corresponding triples. For this, we extend the recently proposed QB4OLAP vocabulary in order to support the representation of the constructs of the multidimensional model. We conclude the paper showing how complex real-world OLAP queries expressed in SPARQL can be posed to the resulting model.
1
Introduction
Data warehouses (DWs) [12] are represented using the multidimensional model, which views data in an n-dimensional space, usually called a data cube, consisting of facts (the cells of the cube) linked to several dimensions. A fact represents the focus of analysis (for example, analysis of sales in stores) and typically includes attributes, called measures, usually represented as numeric values. Dimensions are organized into hierarchies which allow users to explore and aggregate measures at various levels of detail. DWs are exploited using online analytical processing (OLAP) tools. The web is changing the way in which data warehouses are designed, used, and exploited [4]. For some data analysis tasks (like worldwide price evolution of some product), the data contained in a conventional data warehouse may not suffice. The web can provide useful multidimensional information, although usually too volatile to be permanently stored. Further, organizations may want to share their data cubes, for example, among different branches. In the semantic web, domain ontologies expressed in RDF (the basic data representation layer) or in languages defined on top of RDF like OWL, define a common terminology for the concepts involved in a particular domain. In
2
addition, many applications attach metadata and semantic annotations to the information they produce (e.g., in medical applications, medical imaging, laboratory tests). Thus, large repositories of semantically annotated data are currently available, opening new opportunities for enhancing current decision-support systems. In this paper we address the analysis of semantic web data using OLAP techniques. A key requisite for this is the logical representation of multidimensional data in RDF, for which a common vocabulary with a clear semantics must be used. The first proposal to cover many of the multidimensional model components was the QB4OLAP vocabulary [5,6]. Usually, in a relational DW representation, a conceptual model is translated into a collection of tables organized in specialized structures, basically star and snowflake schemas, which relate a fact table to several dimension tables. In a semantic web DW scenario, the logical model becomes the RDF data model. In this paper, after a brief introduction to semantic web concepts (Section 2), we present an extension of QB4OLAP that supports the most used model characteristics, like balanced, recursive, ragged, and many-to-many hierarchies (Section 3). We then propose a translation of the multidimensional model into a logical RDF model through a set of rules that map each conceptual schema, and the corresponding DW instances, into a set of RDF triples, using the QB4OLAP vocabulary. This translation is implemented as an R2RML4 mapping, that can then be used to generate the triples (Section 4). Finally, we show how the model can be queried using SPARQL, and discuss open challenges and future work (Section 5).
2
Preliminary Concepts
RDF and SPARQL The resource description framework (RDF)5 is a formal language for describing structured information. To uniquely identify resources, RDF uses internationalized resource identifiers (IRIs). RDF expresses assertions over resources, as subject-predicate-object triples, where subject are resources or blank nodes, predicate are resources, and object are resources, blank nodes or literals (data values). Blank nodes are used to represent resources without an IRI, typically with a structural function, for example, to group a set of statements. A set of RDF triples can be seen as a directed graph where subjects and objects are nodes, and predicates are arcs. A set of reserved words, called RDF Schema (RDFS),6 is used to define properties and represent relationships between resources, adding semantics to the terms in a vocabulary. Examples are rdf:type, rdf:Class, rdfs:Property (denotes the classes of all properties), rdfs:subClassOf, and rdfs:subPropertyOf. Finally, in this paper we use Turtle7 to represent RDF graphs. For example, information of an employee is expressed in Turtle as (a stands for rdf:type): ex:iri ex:hasEmployee ex:employee1 . ex:employee1 a ex:employee ; ex:firstName ”Nancy” ; ex:lastName ”Davolio” ; ex:hireDate ”1992-05-01” .
In this query, ‘ex:’ is a namespace prefix which we omit, for the sake of space. Finally, blank nodes are represented either explicitly with a blank node identifier, or with no 4 5 6 7
http://www.w3.org/TR/r2rml/ http://www.w3.org/TR/rdf11-concepts/ http://www.w3.org/TR/rdf-schema/ http://www.w3.org/TR/turtle/
3
name using square brackets. For example, the following triples state that ex:employee1 has a supervisor who is an employee called Andrew Fuller. ex:employee1 a ex:employee ; ex:supervisor [ a ex:smployee ; ex:firstName ”Andrew” ; ex:lastName ”Fuller” ] .
SPARQL8 is the standard query language for RDF. The SPARQL query below, asks for names and hire date of employees. SELECT ?firstName ?lastName ?hireDate WHERE { ?emp a ex:employee ; ex:firstName ?firstName ; ex:lastName ?lastName ; ex:hireDate ?hireDate . }
The SELECT clause indicates the format of the result (‘?’ denotes a variable). The WHERE clause contains a graph pattern composed of four triples in Turtle notation. The query is evaluated instantiating the variables, and matching the query graph against the triples in the underlying RDF graph (if there is no FROM clause with named graphs, the graph is the default one). Relevant to OLAP, SPARQL provides the usual SQL aggregate and sorting functions, also using the GROUP BY, HAVING, and ORDER BY keywords. For the sake of space we omit SPARQL details, which can be found in the references. R2RML Mapping R2RML is a language for expressing mappings from relational databases to RDF datasets, allowing representing relational data in RDF using a customized structure and vocabulary. Both R2RML mapping documents (written in Turtle syntax) and mapping results are RDF graphs. The main object of an R2RML mapping is the triples map, which is a collection of triples composed of a logical table, a subject map, and one or more predicate object maps. A logical table is either a base table or a view (using the predicate rr:tableName), or an SQL query (using the predicate rr:sqlQuery). A predicate object map is composed of a predicate map and an object map. Subject maps, predicate maps, and object maps are either constants (rr:constant), column-based maps (rr:column), or template-based maps (rr:template). Templates use brace-enclosed column names as placeholders. Foreign keys are handled through referencing object maps, which use the subjects of another triples map as the objects generated by a predicate-object map. We show examples of R2RML mappings in Section 4.
3
RDF Representation of Multidimensional Data
The RDF data cube vocabulary9 , or QB, is used to publish statistical data in RDF. The QB4OLAP vocabulary10 extends QB to enhance the support to the multidimensional model, overcoming several limitations of QB [6]. Figure 1 depicts the QB4OLAP vocabulary, which embeds QB, allowing data cubes already published using QB, to be represented using QB4OLAP without affecting existing applications. Original QB terms are prefixed with qb:. Capitalized terms represent RDF classes and noncapitalized terms represent RDF properties. Classes in external vocabularies are depicted in light 8 9 10
http://www.w3.org/TR/sparql11-query/ http://www.w3.org/TR/vocab-data-cube/ http://purl.org/qb4olap/cubes
4 !"#*/>8/.+.&
!"#:/>8/.+.&'8+*-,-*%&-/.
!"#$%&%'&()*&)(+$+,-.-&-/.
!"#'3-*+4+5
!"#1&()*&)(+
ALMLN$
!"#;->+.1-/.
!"#*/>8/.+.&@+!)-(+;#"//3+%. !"#*/>8/.+.&6&&%*=>+.(;,1#:3%11 !"#/(;+(#I?1;#-.&
!"#13-*+4+5
:3%11
!"#%&&(-")&+ !"#>+%1)(+
0"K+*&I8(/8+(&5
!"B/#%CC(+C%&+D).*&-/.
!"#*/>8/.+.&7(/8+(&5
!"#*/>8/.+.&7(/8+(&5
')"*3%11
!"B/#*%(;-.%3-&5
!"#:/>8/.+.&7(/8+(&5 !"#$%&%'+&
!"B/#62C
!"B/#3+2+3 !"#13-*+
!"#13-*+'&()*&)(+
!"B/#:/).&
!"#9+%1)(+7(/8+(&5
!"B/#9-.
!"B/#6CC(+C%&+D).*&-/.
!"#'3-*+
!"B/#9%? !"#6&&(-")&+7(/8+(&5 !"#;%&%'+&
!"#/"1+(2%&-/.
!"B/#')>
!"#1)"'3-*+
!"#$->+.1-/.7(/8+(&5 !"#0"1+(2%&-/.
!"B/#E-+(%(*=5'&+8
!"#*/.*+8&
!"B/#-.$->+.1-/.
1?#:/.*+8&
!"B/#=-+(%(*=5:/>8/.+.&
!"B/#0.+H/0.+ 1;>?#:/.*+8&@/3+
!"B/#:%(;-.%3-&5
!"B/#8%(+.&A+2+3
!"B/#=%1E-+(%(*=5
!"B/#0.+H/9%.5
!"B/#3+2+3:/>8/.+.&
!"B/#A+2+37(/8+(&5
!"B/#9%.5H/0.+ !"B/#9%.5H/9%.5
!"B/#A+2+3J.E-+(%(*=5
!"B/#=%1A+2+3
!"#*/;+A-1&
!"#:/;+;7(/8+(&5
FF).-/.GG
!"B/#*%(;-.%3-&5
1