A Model for Linked Open Data Acquisition and SPARQL Query

A Model for Linked Open Data Acquisition and SPARQL Query Generation Céline Alec, Chantal Reynaud-Delaître, and Brigitte Safar LRI, Univ. Paris-Sud, CNRS, Université Paris-Saclay, Orsay, F-91405 {celine.alec,chantal.reynaud,brigitte.safar}@lri.fr

Abstract. Nowadays, Linked Open Data (LOD) represents a promising source for many applications of the Semantic Web. However, appropriate data acquisition techniques have to be developed to overcome both incompleteness and redundancy. Our work addresses this problem in the scenario of populating an OWL ontology with property assertions considering the existence of multiple, equivalent, multi-valued and missing properties in the LOD. Since the correspondences to be built are complex, we propose a model to specify them. We also define a model to specify alternative paths to access properties in case of missing values. We then show how these models are used to automatically generate SPARQL queries and thus, facilitate interrogation. A running example is given. Keywords: Linked data acquisition; Complex correspondences; SPARQL query generation.

1

Introduction

During the last years, new sources have emerged, like Linked Open Data (LOD), which are promising for many Semantic Web applications. However, LOD datasets reveal data quality problems such as incompleteness, redundancy, inconsistency [15]. Given these issues, specific acquisition techniques have to be developed. In this paper, we propose a model to support LOD data acquisition. Our work is done in the settings of an ontology-based method [1] to discover definitions of concepts based on the individuals with known property assertions, using machine learning techniques. The quality of results depends on the inputs, particularly property assertions characterizing individuals. Having complete information is essential in order to achieve the best results as output of the entire process. First, property assertions are extracted from a corpus of texts. Then, as texts that are considered are incomplete for some required property assertions, LOD datasets are exploited. The approach is applicable to various domains. For a given area, the designer selects the most relevant datasets for the involved properties and indicates the properties to be valued. The number of these properties is assumed not to be too large, so this task can be manually done. However, the ontology and the LOD datasets differ in regards to vocabulary and structure. Complex correspondences have to be defined. In this work, we propose a model to define them when datasets include redundant properties, missing values and multi-valuation. This is our first contribution.

2

Céline Alec, Chantal Reynaud-Delaître, and Brigitte Safar

This model is also a support to ease the access to LOD data. Automatic generation of SPARQL [6] queries can be conducted from this model. We assume that the structure of SPARQL queries can be determined by our acquisition model, in a domain independent way. This is our second contribution. Our work is innovative because complex processes needed to access information not explicitly represented in LOD datasets can be defined. Moreover, the querying part to get this information is completely transparent for the user. To our knowledge, those issues have not yet been addressed by the state-of-the-art. Section 2 presents an example of complex data motivating our work. Section 3 reviews related works. The model of acquisition is presented in Section 4. Section 5 explains how it supports automatic generation of SPARQL queries, for which a running example is given in Section 6. We conclude in Section 7.

2

A motivating example

Let us suppose that we are looking for assertions of the property temperature in the ontology myOnto. This property represents the average temperature characterizing the concept Weather. Individuals of Weather correspond to meteorological data of a given place for a given month. Thus, myOnto:WeatherJanuaryNaples is an individual of Weather corresponding to the meteorological data of Naples in January. This individual is characterized by the two following assertions: , . In DBpedia [2], several properties, e.g., dbp:janHighC, dbp:janLowC, dbp:janMeanC, etc, can be used to establish a correspondence with temperature. Each property concerns a particular month and its value may be expressed with particular measure units (Celsius, Fahrenheit). Furthermore, each property may be multi-valued. For example, dbp:janMeanC has two values for the DBpedia resource Naples: 8.1 (xsd:double) and 8.7 (xsd:double). To find assertions of the property temperature for myOnto, a correspondence with DBpedia properties has to be established. An example of the correspondence could be between: (i) the source property temperature with the constraint that its domain concerns January (via the property concernMonth) and (ii) a target expression Avg(Avg(janM eanC), Avg(Conv(janM eanF )), 1 1 2 (Avg(janHighC)+Avg(janLowC)), 2 (Avg(Conv(janHighF ))+Avg(Conv(janLowF ))).

Each Avg deals with multi-valuation by returning a unique value. This correspondence is complex because we have to deal with a source constraint (domain concerns January) and a target property resulting from many calculations and aggregations, therefore not explicitly represented in the dataset. Moreover, some places in DBpedia, like Alaska, have no valuation for any of the above properties. Nevertheless, alternative approximate values may be found, as the meteorological data about Juneau, the capital of Alaska. A close version of this example is used in section 6 to explain the automatic query generation.

3

Related Work

Our work is related to several research fields, such as RDF data mediation, question answering and ontology alignment with complex matches.

A Model for Linked Open Data Acquisition and SPARQL Query Generation

3

Some proposals have been provided for adopting query reformulation to achieve data mediation in the Linked Data cloud and produce flexible mediators. [3] proposes a graph pattern rewriting algorithm that exploits graph rewriting rules and ontology alignments in order to translate SPARQL queries to run the same query on different data sets. It focuses on data manipulation functions to handle information represented and aggregated in different ways such as different unit measures or different representations of properties (i.e., address). These data manipulations that we also consider in our work, remain actually quite simple. A generic method for SPARQL query rewriting is provided in [9]. A set of predefined mappings between ontology schemas represented in OWL is exploited. Note that, in this work, a mapping containing aggregates was considered as meaningless since the current SPARQL version could not represent them. Our work is related since constraints are represented in OWL too, but we consider more complex correspondences and, above all, aggregates. Query answering approaches have been proposed as intuitive ways of accessing RDF data [8]. Questions are translated into triples, which are matched against RDF data relying on some heuristics and similarity measures. Then triple patterns are executed as SPARQL queries. However, sometimes the translation process is not possible. The question answering system described in [14] focuses on complex situations with questions including specific determiners as more than or the most leading to HAVING or ORDER BY clauses in SPARQL. Their strength is to generate queries from SPARQL templates. We use this idea in our work too. In other proposals, such as [13], user queries are limited to keywords, which are mapped to IRIs. Then, graph patterns are generated and executed as SPARQL queries. Once again, we differ because, in addition, we aim at retrieving information that may not be explicitly represented. This implies that accessing first relevant data is needed to compute the required one. Relevant data may be useful in the calculation of the required data without being similar or equivalent to it. In that case, queries must often include sub-queries. Such complex queries can not be derived based on question analysis. Finally, several ontology alignment approaches have emerged in the literature in the last years [4] and some of them involve complex correspondences. In [11,12], correspondence patterns are presented as tools facilitating definition of complex correspondences between ontologies. They can be used to derive complex correspondences from simple previously calculated ones. Very often, techniques proposed for complex matchings exploit simple matches generated beforehand. This is the case, for instance, of the approach proposed by [10] for complex datatype properties from two classes, relying on an instance-based matching technique to identify simple matches. Finally, [5] defines complex correspondences for rewriting query patterns. Indeed query rewriting requires complex 1:n, n:1 or n:m correspondences involving equivalence, more general and more specif ic relations and expressed using constructors of a formal language. From these proposals, we see that complex correspondences are needed for query rewriting. The process to define them is complex, even if they are usually defined from simple ones. We differentiate ourselves on this point. A correspondence between a source entity and a target entity resulting from a complex process and

4


involving several aggregates applied in succession, cannot be derived from a simple correspondence previously acquired. Furthermore, entities involved in such complex treatments cannot be discovered by any existing alignment system.

4

A model for Linked Open Data acquisition

We consider a source ontology Os and a target ontology Ot . Os is an OWL ontology. Ot is an ontology providing an access to RDF sources, such as LOD datasets. Each ontology is defined as a tuple (C, P, I, A) where C is a set of concepts, P a set of (datatype, object and annotation) properties characterizing concepts, I a set of individuals and property assertions, and A a set of axioms. This section defines a model to acquire Os property values from a LOD dataset. This task requires: (1) to find the individual it from Ot corresponding to the individual is from Os to be characterized, (2) to find the properties from Ot , called target properties, corresponding to a property from Os that needs to be valued, called source property. This paper is limited to the second point. It is assumed that a preliminary step resulted in a set Inds,t of couples (is ,it ) where an individual is from the source ontology corresponds to an individual it from the target ontology. In the following, we present a model to acquire LOD property values composed of: (1) a model establishing correspondences between a source property and a target property, and (2) a model defining paths to access property values. 4.1

Model of correspondences

The model of correspondences defines correspondences (P Es , P Et ) between a source Property Expression (P Es ) and a target Property Expression (P Et ). A correspondence applies for each couple (is , it ) ∈ Inds,t such as: is ∈ domain(P Es ) and it ∈ domain(P Et ). Definition 1. A property expression in Os (P Es ) is an object property (op) or a datatype property (dp), possibly with a domain restriction expressed in OWL-DL noted P Es .Restrict(d). P Es = op | dp | P Es .Restrict(d) Example 1. The property temperature.(concernMonth VALUE January) is a P Es . It concerns the datatype property temperature whose domain (Weather) is constrained by (concernMonth VALUE January) expressed in Manchester syntax [7]. This P Es only characterizes is corresponding to January meteorological data. Now, we define the notion of property expression in Ot (P Et ), starting with the notion of elementary property on which the definition of P Et is based. Definition 2. An elementary property pe is a property in Ot or its inverse. pe = op | dp | op−1 Definition 3. A property expression in Ot (P Et ) is an elementary property (pe ) in Ot , or an expression (f ) using one or several property expressions in Ot . A P Et may include domain constraints (P Et .Constr(d)), range constraints


5

(P Et .Constr(r)) or constraints on an element involved in a range constraint (P Et .Constr(f (Constr(r)))). P Et = pe | f(P Et ) | f(P Et , P Et ) | P Et .Constr(d) | P Et .Constr(r) | P Et .Constr(f (Constr(r))) By recursion, a P Et can be a function of n P Et . For example, P Et1 = f (P Et2 , f (P Et3 , P Et4 )) = f (P Et2 , P Et3 , P Et4 ). The function f is specified by the designer. When f is unary, he has to indicate whether it is an operation of aggregation or an operation of transformation and to clarify its nature. For an operation of aggregation, he must choose between a counting operation, a sum, minimum, maximum, average, etc. For an operation that transforms values, the choice is between a mathematical calculation, a calculation of the length of a string, a change from uppercase to lowercase or vice versa, the calculation of the absolute value of a number, etc. If f is n-ary, the designer must indicate whether the function corresponds to a set-theoretic operation (union or difference) or an operation of transformation (mathematical calculation, concatenation, etc.). Constr is any constraint expressible as a SPARQL graph pattern (possibly with a filter constraint). Constr(f (Constr(r))) is a constraint on the aggregation (f ) of a variable involved in a range constraint defined beforehand (cf example 3). Examples of target properties given in the following are extracted from DBpedia. The valuation of a P Et is defined as follows: val(P Et ) = val(pe ) | f (val(P Et )) | f (val(P Et ), val(P Et )) | val(P Et .Constr(d)) | val(P Et .Constr(r)) | val(P Et .Constr(f (Constr(r)))) Example 2. Avg(janPrecipitationMm, janRainMm) is a P Et whose value is the average of all values from the union of janPrecipitationMm and janRainMm. To make sure this average does not take into account negative values, a range constraint might be introduced to reject negative values, e.g., FILTER(r >=0). Example 3. artist is a DBpedia property whose domain is MusicalWork and range is Agent. Thereby, artist−1 is a pe (by extension a P Et ), which, combined with the range constraints r rdf:type dbo:Album and r dbo:genre ?gen, corresponds to album names (with a genre). Adding the constraint COUNT(DISTINCT ?gen)>=3 on the variable ?gen involved within the constraints mentioned above, requires that there are at least three different genres. In summary, such a model is interesting because it specifies correspondences with P Et that may have a very complex definition. However, this model alone is not enough to deal with missing values in the LOD datasets. The next section presents a path model to address this issue. 4.2

Path Model

Accessing paths for a P Et The model of correspondences requires an access to values of P Et . When a P Et has a value, it can be directly accessed without any difficulties. When this is not the case, we propose an alternative solution because we are in a context where values for P Et are essential, even if it is an approximation. Thus, when a P Et of an individual it is not valued, we look for

6


its value(s) for other individuals than it . For example, if we do not find the value of the mean temperature in Alaska, we can search it for Juneau, its capital. This value will be an approximation of what we are looking for. The access to this P Et for it is not direct since another individual than it has to be reached first. This led us to define two types of access to a P Et , a direct access and a composed access. We give in this section the definition of each access type. It is based on the notion of property path defined w.r.t. elementary property paths. Definition 4. An elementary nth -order property path (n ≥ 1), pathelem , is a n composition of object properties (p1 .p2 ...pn ). For a target individual it , it allows an access to a set of target individuals It in Ot . The access to a set of target individuals using an elementary path is explained by properties composing the path that could be multivalued. Example 4. In Figure 1, country.capital is an elementary 2nd -order property path (p1 .p2 ). Applied to the individual Cephalonia in DBpedia, it accesses to the set It = {Athens}. capital is an elementary 1st -order property path (p1 ). Applied to the individual Sri_Lanka in DBpedia, it accesses to the set It = {Colombo; Sri_Jayawardenepura_Kotte}, because the property capital of Sri_Lanka is multivalued. Definition 5. An nth -order property path (n ≥ 1) is an elementary path or a set of elementary paths, possibly with a constraint on It and/or it . ) | pathn .Constr(it ) | pathn .Constr(It ) | Set(pathelem pathn = pathelem n n dbp:janPrecipitationMm

Abu_Dhabi

dbo:capital Alaska dbo:country Cephalonia

Greece dbp:capital

Sri_Lanka

dbp:capital

7 (xsd:integer)

dbp:janPrecipitationInch Juneau, 5.35 (xsd:double) _Alaska 7.98 (xsd:double) dbo:capital dbp:janRainMm Athens

56.9 (xsd:double)

dbp:janPrecipitationMm 58.2 (xsd:double) Colombo

dbp:janPrecipitationMm Sri_Jayawardenepura_Kotte 85 (xsd:integer)

Fig. 1. Examples of paths to access to precipitation in DBpedia

Definition 6. The access to a P Et is composed if it is accessed by an nth -order property path (n≥1), direct otherwise. Example 5. Let P Et = Avg(janPrecipitationMm) correspond to the average of values of janPrecipitationMm. Its access is direct for Abu_Dhabi and composed for Sri_Lanka (cf. Figure 1). As said in Definition 5, a path can be a set of elementary paths. This is most likely to occur when redundant properties belong to the path. For example, part, isP artOf −1 and p are three ways to express the notion of subpart in DBpedia.


7

Moreover, constraints on it and/or It are possible. A path constraint may concern the initial individual or the set of individuals accessed by paths from it. Example 6. Suppose we want to know the country of origin of a film. The P Et will be the union of the corresponding properties in DBpedia, e.g., country and nationality. In case of missing values, the nationality of the director will be a good approximation. Hence, Windy_City_Heat is a film without any country of origin in DBpedia. Nevertheless, knowing that Windy_City_Heat.director.nationality = Americans, its American origin can be deduced. Constraint on it : If the designer believes that, with the globalization, many directors now realize films outside their own countries, he can add a constraint expressing that this approximation is valid only if the film is not recent (e.g., before 2000). This constraint is about the film (it ), and can be expressed by the following graph pattern: it dbp:released ?year. FILTER (?year

A Model for Linked Open Data Acquisition and SPARQL Query

A Model for Linked Open Data Acquisition and SPARQL Query

Suggest Documents

Linked Data Query Wizard: A Novel Interface for Accessing SPARQL

Federated Data Management and Query Optimization for Linked Open ...

SPARQL Query Rewriting for Implementing Data ... - CiteSeerX

A Model for SPARQL Query Processing, optimization and ... - IJRET

SPARQL-LD: A SPARQL Extension for Fetching and Querying Linked ...

Analytics for Citizens: a Linked Open Data Model for ...

PigSPARQL: A SPARQL Query Processing Baseline for Big Data

SPARQL SAHA, a Configurable Linked Data Editor and ... - ESWC 2014

Open linked data model revelation and access for analytical web ...

Open Linked Data and Europeana

Detecting SPARQL Query Templates for Data Prefetching - European ...

Dynamic Linked Data: A SPARQL Event Processing Architecture - MDPI

Linked Open Data Contest

Linked Open Data Contest

MODEL: A SOFTWARE SUITE FOR DATA ACQUISITION

Towards Query Optimization for SPARQL Property Paths

QFed: Query Set For Federated SPARQL Query Benchmark - AKSW

SparqPlug: Generating Linked Data from Legacy HTML, SPARQL and

A Drag-and-block Approach for Linked Open Data Exploration

Acquisition, Representation, Query and Analysis of Spatial Data: A ...

Linked Open Graph: browsing multiple SPARQL entry points to build ...

Linked Data Query Processing Strategies Technical Report

Linked Open Data and Web Corpus Data for ... - LREC Conferences

Linked Data Query Processing Strategies Technical Report