Processing SPARQL Queries Over Linked Data— A Distributed Graph-based Approach Peng Peng1 , Lei Zou1 , M. Tamer Özsu2 , Lei Chen3 , Dongyan Zhao1 1
Peking University, China; University of Waterloo, Canada; 3 Hong Kong University of Science and Technology, China; 2
arXiv:1411.6763v1 [cs.DB] 25 Nov 2014
{ pku09pp,zoulei,zhaodongyan}@pku.edu.cn,
[email protected],
[email protected]
ABSTRACT
different than graph subgraph matching in general graph processing systems since these depend on graph isomorphism while SPARQL query semantics are based on graph homomorphism. There are several ways to address this issue [14]. One method is analogous to data warehousing where all of the data is downloaded into a single RDF database and queries are run over it using centralized techniques, such as Jena [32], RDF-3x [24], SW-store [1] and gStore [36]. This is a reasonable approach, but at web scale, the feasibility of these approaches is a serious concern as is the freshness of query results. A second approach that favors query result freshness is live querying [13]. In this approach, one starts with a set of seed LOD documents and follows the RDF links inside these until no further documents are reachable. These approaches provide up-to-date results, but have high latency that is hard to optimize. The third alternative is to treat the interconnected RDF repositories as a virtually integrated distributed database, where each individual RDF repository is regarded as a fragment of this database. Each fragment Fi resides at its own site S i and each query is executed at the relevant sites. Within this context, there are two approaches. One is to decompose queries and evaluate each subquery at the appropriate sites. The query decomposition strategy requires having full knowledge about the underlying schema of each fragment Fi , which is not always feasible in RDF datasets since their fundamental characteristic is that they are self-describing without requiring an explicit schema. This is the approach followed by most existing works [18, 28, 27]. For example, DARQ [27] needs the service description that describes which triple patterns can be answered in each site, based on which, the SPARQL query is decomposed into several subqueries. The alternative approach is to use the “partial evaluation and assembly” strategy [17]. The basic idea of the partial evaluation is the following: given a function f (s, d), where s is the known input and d is yet unavailable input, the part of f ’s computation that depends only on s generates a partial answer [8]. In our setting, each site S i treats fragment Fi as the known input in the partial evaluation stage. The partial evaluation technique has been used in compiler optimization [17], querying XML trees [4], and evaluating reachability queries [8] and graph simulation [23, 9] over graphs. In this paper, we follow the partial evaluation and assembly strategy to evaluate SPARQL queries over a distributed and interconnected set of RDF repositories. Because of interconnections between datasets, processing of queries requires special care. For example, consider the four RDF datasets at different sites in LOD shown in Figure 1 and its graph representation in Figure 2. Each entity in RDF is represented by a URI (uniform resource identifier), the prefix of which always denotes the location of the dataset. For example, “s1:dir1” has the prefix “s1”, meaning that the entity is located at site s1. There are cross-
We propose techniques for processing SPARQL queries over linked data. We follow a graph-based approach where answering a query Q is equivalent to finding its matches over a distributed RDF data graph G. We adopt a “partial evaluation and assembly” framework. Partial evaluation results of query Q over each repository—called local partial match—are found. In the assembly stage, we propose a centralized and a distributed assembly strategy. We analyze our algorithms both theoretically and the experimentally. Extensive experiments over both real and benchmark RDF repositories with billion triples demonstrate the high performance and scalability of our methods compared with that of the existing solutions.
1.
INTRODUCTION
The semantic web data model, called the “Resource Description Framework”, or RDF, represents data as a collection of triples of the form hsubject, property, objecti. A triple can be naturally seen as a pair of entities connected by a named relationship or an entity associated with a named attribute value. Linked Open Data (LOD), which is the focus of this paper, uses RDF to link web objects through their Uniform Resource Identifiers. LOD forms a directed graph where each vertex represents a Web document that contains data in the form of a set of RDF triples. These documents can be retrieved by looking up HTTP-scheme based URIs. A pair of document vertices is connected by a directed edge if any of the RDF triples in the source document contains a URI in its subject, predicate, or object position such that a lookup of this URI results in retrieving the target document. LOD forms a federated (or virtually integrated) database with the documents stored at autonomous sites some of which have the ability to run SPARQL queries over their own data (these are called “SPARQL endpoints”). In this paper we address the issue of executing SPARQL queries over this type of federated RDF repositories. The approach we take to evaluating SPARQL queries is graph-based, which is consistent with RDF and SPARQL specifications. Consequently, answering a SPARQL query Q is equivalent to finding subgraph matches of query graph Q over an RDF graph G, as employed, for example, in the gStore system [36]. We note that the problem in this context is
1
(a) RDF repository F1 Subject s1:dir1 s1:dir1 s1:dir1 s1:dir1 s3:act2 s1:act4 s1:act4 s1:fil1 s1:fil2 s1:fil5 s1:fil7 s3:act3 s2:act1 s2:act1
Property directed directed hasWonPrize rdfs:label actedIn rdfs:label actedIn rdfs:label rdfs:label rdfs:label rdfs:label actedIn actedIn isMarriedTo
(b) RDF repository F2
Object s1:fil1 s1:fil2 s3:awa1 Woody Allen s1:fil5 Hank Azaria s1:fil7 Sleeper Small Time Crooks A Very Young Lady Mystery Men s1:fil2 s1:fil7 s1:dir1
Subject s2:act1 s2:act1 s2:act1 s2:act1 s2:act1 s2:act1 s4:dir2 s2:fil3 s2:fil4 s2:pla1
Property actedIn actedIn actedIn isMarriedTo livesIn rdfs:label livesIn rdfs:label rdfs:label rdfs:label
(c) RDF repository F3
Object s2:fil3 s2:fil4 s1:fil7 s1:dir1 s2:pla1 Louise Lasser s2:pla1 Mary Hartman Slither New York
Subject s3:act2 s3:act3 s3:act3 s3:act3 s3:act2 s3:awa1 s3:act2
Property rdfs:label actedIn hasWonPrize rdfs:label isMarriedTo rdfs:label actedIn
Object Nancy Kelly s1:fil2 s3:awa1 Hugh Grant s4:dir2 Cesar Award s1:fil5
(d) RDF repository F4 Subject s3:act2 s4:dir2 s4:dir2 s4:dir2 s4:fil6
Property isMarriedTo directed livesIn rdfs:label rdfs:label
Object s4:dir2 s4:fil6 s2:pla1 Edmond O’Brien Man-Trap
Figure 1: Sample RDF Datasets in LOD 029 Hank Azaria 014 Sleeper
rdfs:label
028
027 Mystery Men
s1:act4
Table0091: Sample RDF Datasets in LOD
001, 002, 009 and 018 is a crossing match between fragments F1 and Fqueries 2 (shown in the shaded vertices and red edges). 2 in Figure and propose an efficient algorithm to find them. We This is the focus of thisthepaper. prove theoretically correctness and the optimality of our
s2:fil3
rdfs:label
rdfs:label actedIn that yet unavailable input,actedIn the part of f ’s 002 computation depends on016 rdfs:label Woody Allen answer [13]. In our setting, each 018 site 005 s1:fil7 007 ly on s generates a partial actedIn s2:act1 Mary Hartman s1:fil1 S i treats fragment Fi as the known input in the partial evaluation rdfs:label ismarriedTo stage. The partial evaluation technique hasrdfs:label been used in many ardirected actedIn 015 livesInquerying 001 eas, such as compiler optimisation [19], XML010trees [9] 017 Small Time Crooks s1:dir1 queries over graphs [13].013 and reachability In Louise this Lasser paper, we follow s4:fil4 026 A Veryand Young Lady the partial assembly strategy and directed evaluation s2:pla1 over a distributed rdfs:label rdfs:label interconnected set of RDF repositories. rdfs:label 019 008 011 Q is equivalent Slither rdfs:label Answering a SPARQL query to finding subgraph 020 hasWonPrize s1:fil2 s1:fil5 matches of Q over an RDF graph G [31]. FiguresNew1 York and 2 show an example RDF graph and a query graph,livesIn respectively. A subgraph match canFbe completely included inFragment a fragment,Fwhich we call an Fragment 1 at Site S1 actedIn 2 at Site S 2 inner match. These inner matches can be found locally by existing 003 004at each site. The key issue in the distributed centralized techniques s3:awa1 s4:dir2 s3:act2 ismarriedTo actedIn 012 is how to find subgraph matches that cross multiple environment rdfs:label fragments—these are called as crossing matches. This 021 is the focus hasWonPrize Edmond O'Brien of this paper. For example, rdfs:label given a SPARQL query Q in Figure 006 2, the rdfs:label subgraph induced by vertices 014,directed 007, 001, 002, 009 and 022 023 s3:act3 018 is a crossing match between fragments F 1 and FMan-Trap 2 in Figure 1 Nancy Kelly (shown in the shaded vertices and red edges). rdfs:label 024 ThereCesar areAward two important issues to be012addressed in the partial erdfs:label s4:fil6 025 valuation and assembly framework. The first one is what is the Hugh Grant partial evaluation result in answering SPARQL queries. We refer to thisFas the local partial match (see Definition 4.1), Swhich Fragment Fragment F4 at Site 3 at Site S3 4 is intuitively the overlapping part between a crossing match and a fragment. The second one is how to assemble these local partial Figure 2: An Integrated Distributed Graph matches to form crossing matches. In this work, we consider two different strategies. One is the centralized assembly, where all local partial matches are sent to a single site. The other one is to ing links between twolocal datasets font. For example, assemble the partialidentified matches atina bold number of sites in parallel, “hs2:act1called isMarriedTo s1:dir1i” a crossing link (links we between distributed assembly.is For distributed assembly, design different adatasets), which that act1algorithm. (at site s2) married to synchronous and means an asynchronous Weisexperimentally study dir1 (at site s1). the performance of alternatives in different settings. The experimental are useful for designing an that efficient SPARQL Now consider the results following SPARQL query Q consists of engine over ?a distributed RDF repositories. five triplequery patterns (e.g., isMarriedTo ?d) over these four interTo summarize, we make the following contributions.
algorithm.
?d directed
?f2 rdfs:label
v6 ?n2
3. We propose centralized and distributed methods to assemble the local isMarriedTo partial matches. We experimentally study their perv3 v5 v1 formance advantages and bottlenecks in different settings.
?a
actedIn
?f1 rdfs:label ?n1
4. We compare our method with the state-of-the-art distributed RDF systems over 3: several real andQuery benchmark RDFQreposiFigure SPARQL Graph tories with billion triples. The extensive experiment results confirm the advantages of our approach.
There are two important issues to be addressed in this framework. The first is to compute the partial evaluation results at each site given a query graph Q (i.e., the local partial match), which is intuitively the overlapping part between a crossing match and a fragment. This is discussed in Section 4. The second one is the as+DQN$]DULD 0\VWHU\0HQ VILO UGIVODEHO sembly6OHHSHU of these local partial matches to compute crossing matches. UGIVODEHO UGIVODEHO VDFW DFWHG,Q DFWHG,Q strategies: centralized assembly, where We consider two different UGIVODEHO :RRG\ $OOHQ VILO DFWHG,Q VDFW all local 5.2); and 0DU\ +DUWPDQ VILOpartial matches are sent to a single site (Section UGIVODEHO LVPDUULHG7R distributedGLUHFWHG assembly, where the local partialUGIVODEHO matches are assembled DFWHG,Q OLYHV,Q at6PDOO a number 7LPH &URRNV of sites in parallel (Section 5.3). VILO VGLU /RXLVH /DVVHU This is the first work that considers the LOD repositories (or sim$9HU\