Linked Data and Live Querying for Enabling. Support Platforms for Web Dataspaces. Jürgen Umbrich1, Marcel Karnstedt1, J
Digital Enterprise Research Institute
www.deri.ie
Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces Jürgen Umbrich1, Marcel Karnstedt1, Josiane Xavier Parreira1, Axel Polleres2, Manfred Hauswirth1 1DERI, National University of Ireland, Galway, Ireland 2Siemens AG Österreich, Vienna, Austria
© Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
1
Outline Digital Enterprise Research Institute
Web as a set of interlinked Web dataspaces Enabling DSSP for Web dataspaces Linked Data Missing components Challenges Efficient query processing Challenges Index consistency study Hybrid query processing mechanism
www.deri.ie
The World Wide Web Digital Enterprise Research Institute
www.deri.ie
CSV HTML
href http
href
rel
href http
HTML
href href
CSS
HTML
img PNG
PDF
Web of Documents
The World Wide Web Digital Enterprise Research Institute
www.deri.ie
CSV HTML
href http href
rel
CSS
href
href • Unstructured • Heterogeneous http HTML • Data integration href HTML mostly manual img PNG
PDF
Web of Documents
The World Wide Web Digital Enterprise Research Institute
www.deri.ie
RDF CSV HTML
href
href
CSS
rel
href http
HTML
href
HTML
img PNG
PDF
RDF
RDF
Web of Data
The World Wide Web Digital Enterprise Research Institute
www.deri.ie
RDF CSV HTML
CSS
rel
href • Standards href
href
href • URIs as identifiers • Typed links http HTML • Web heterogeneous HTML img distr. DB PNG
PDF
RDF
RDF
Web of Data
Dataspace Support Platforms Digital Enterprise Research Institute
www.deri.ie
[Franklin 2005]
Dataspace Support Platforms Digital Enterprise Research Institute
www.deri.ie
• data management for smallscale loosely connected heterogeneous source • services hide complexity of data management
[Franklin 2005]
Two directions, similar goals Digital Enterprise Research Institute
CSV
www.deri.ie
RDF
CSS HTML
HTML
HTML
PNG PDF
RDF RDF
Web of Data
web-scale heterogeneous distributed database
Dataspace
data management for small-scale loosely connected heterogeneous source
Proposed Solution Digital Enterprise Research Institute
CSV
RDF
www.deri.ie
CSS HTML
HTML
HTML
PNG PDF
RDF RDF
Web of Data
Dataspace
Linked Data for enabling support platforms for Web dataspaces
Web Dataspaces and support platforms
Digital Enterprise Research Institute
www.deri.ie
standards no guarantees
RIF SKOS GRDDL
no central control
RDF SPARQL OWL RDFa
search
catalogs RDF
RDF
CSV
RDF
query indexes
HTML HTTP PDF
REST API
discovery dynamic
incomplete knowledge
heterogeneous administration
enhancement
DSSP -> Linked Data Digital Enterprise Research Institute
www.deri.ie
Participants/Relationships -> Resources/Links XML for interchanging data -> RDF standardised access method common query language -> HTTP/SPARQL Global keys -> URIs Discovery -> crawling/reasoning Integration of other dataspaces -> entity recognition, ontologies
Open Challenges Digital Enterprise Research Institute
www.deri.ie
Graph-Based Data Model to scale to the size of the Web Efficient processing methods (index, query)
Search and Query Structured queries with keyword search Ranking (different levels, typed links, trust, etc) Guarantees: Full guarantee not possible, assessment of possible guarantees is needed
Query Processing Digital Enterprise Research Institute
www.deri.ie
Catalogs for query planning/processing Key component on a DSSP Linked Data: vocabularies, meta data descriptions as catalogs Complete Web catalogs not feasible: scale and dynamics Indexing also affected by dynamics Distributed query processing approaches Works for a few number of large repositories Web of Data: large number of small repositories
Query Processing Digital Enterprise Research Institute
www.deri.ie
Alternative approach: “live” querying Link traversal query approaches Exploit Linked Data principles (dereferenceable URIs) Guarantee ``live’’ results Query time in the range of seconds Our vision: hybrid query processing Combine offline (static) and online (dynamic) processing Trade-off between performance/complements/ fresheness
Index Consistency Study Digital Enterprise Research Institute
www.deri.ie
Two Linked Data Web index (SPARQL endpoints) Sindice (RDF, RDFa, Microformats, ~ 20 billion triples) Openlink (LOD cache; ~20 billion triples) 16,616 distinct entity queries Sampled from the BTC 2011 dataset Number of entities found and exec. time Web
Sindice
Openlink
Entities found
16616
5007
13096
Avg. query time
3261 ms
136 ms
86 ms
Index Consistency Study Digital Enterprise Research Institute
Web Recall: % of Web results found in the endpoints
www.deri.ie
Index Consistency Study Digital Enterprise Research Institute
Web Recall: % of Web results found in the endpoints
Openlink consistent information for 50% of the entities
www.deri.ie
Index Consistency Study Digital Enterprise Research Institute
www.deri.ie
Web Recall: % of Web results found in the endpoints
Sindice consistent information for 30% of the entities
Index Consistency Study Digital Enterprise Research Institute
Web Recall: % of Web results found in the endpoints
www.deri.ie
Hybrid Query Model Digital Enterprise Research Institute
www.deri.ie
Linked Data Web
guarantees fresh results
Live query interface
SPARQL query query results
provides fast query times
(sub) query query planner (sub) query
results knowledge of dynamics results
Index interface
Repository Repository
hybrid query engine
query planning guided by dynamic knowledge
Query planning Digital Enterprise Research Institute
Knowledge of Dynamics Mining and statistical approaches Query planner Incorporate dynamics as cost factor Latency and availability of sources ? Selectivity based on statistics or rules ? Query Execution Split query into static and dynamic parts Only update potentially outdates results Consider user requirements (fresh vs speed)
www.deri.ie
Conclusion Digital Enterprise Research Institute
Enabling DSSP for Web dataspaces via Linked Data Common data representation Standard assess methods, global keys Still open challenges (e.g. search and query) Study shows that repositories lack completeness and freshness Hybrid query processing Combine offline (static) and online (dynamic) processing Trade-off between performance/completeness/ fresheness
www.deri.ie