Handling input streams. â¡ Implement the execution engine. â¡ Schedule the executions. â¡ Optimization. What need to be done to build a processing engine?
Digital Enterprise Research Institute
www.deri.ie
Part II: Linked Stream Data Processing Building a Processing Engine Danh Le-Phuoc, Josiane X. Parreira, and Manfred Hauswirth DERI - National University of Ireland, Galway Reasoning Web Summer School 2012
© Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
Enabling networked knowledge
Outline Digital Enterprise Research Institute
n
n
Part I: Basic Concepts & Modeling (Josi) ¨
Linked Stream Data
¨
Data models
¨
Query Languages and Operators
¨
Choices/Challenges when designing a Linked Stream Data processor
Part II: Building a Linked Stream Processing Engine (Danh) ¨
Analysis of available Linked Stream Processing Engines – Design choices, implementation – Performance comparison – Open Challenges
www.deri.ie
“Why should I care?” Digital Enterprise Research Institute
n
n
n
As an application developer ¨
What can I do with it?
¨
How does it work?
¨
How to choose one?
As a processing engine developer ¨
How to build one?
¨
How to build/identify a better one?
As a researcher ¨
What have been done? What left?
¨
Is there any interesting research problem?
¨
How to find room to improvement?
www.deri.ie
Why a continuous query processing engine is needed?
Digital Enterprise Research Institute
www.deri.ie
Separation of concerns n n n
Focus of application logic Let the experts deal with data operations on streams Minimize the learning efforts ¨
Learn simple APIs using the engine
¨
Learn a simple query language
How a stream-based application is built with a stream processing engine? Digital Enterprise Research Institute
www.deri.ie
①
Initialize the engine (less than 5 lines of code)
②
Write and register the queries to the engine ( 1 line for 1 query)
③
Write codes for wiring output streams to the application logic (depends on the application logic but the each wiring code snippet ≈5 lines of code)
④
Connect input streams to engine ( 1-3 lines for each stream)
Just 10-20 lines of code!!!
What need to be done to build a processing engine? Digital Enterprise Research Institute
n n
n
Data model : relational, object-oriented, etc Query model : ¨
Logical operators : sliding windows, relational algebras
¨
Query language: CQL, C-SPARQL,CQELS,etc
Build a processing engine ¨
Handling input streams
¨
Implement the execution engine
¨
Schedule the executions
¨
Optimization
www.deri.ie
Building blocks of a query processing engine Digital Enterprise Research Institute
www.deri.ie
Query
Optimizer
Operator implementations Sout
C
.. .
./124
3
./12
[[P3 , t]]G
!
!
[[P1 , t]]S1
[[P2 , t]]S2
4
2
w no
ge
2
1
ra n
Executor
S
.. .
Execution
Access methods
database datastream
Algorithms/technologies/solutions for stream processing Digital Enterprise Research Institute
n
n
n n
Handling live and push-based data stream sources ¨
Time management
¨
Load shedding for bursty streams
Operator implementation for execution engine ¨
Data structure and physical storage
¨
Handling the new stream elements/expired ones
¨
Incremental execution
¨
Memory overflow
Optimization Scheduling
www.deri.ie
Handling input streams
Answers Digital Enterprise www.deri.ie long a tu- Research Institute efore it can DSMS Query Query le, consider Processor Plans wo separate CQ1 CQ2 CQm . Further, nerate a tuTuples previously < τ strictly temporal order le the temf tuples on Heartbeat τ Buffered Input “paused”. tuples > τ Manager emperature Stream arrival o temperahe temperis informaL1 Network L n Load shedding for bursty rate the tuples < v, τ > < v, τ > y processor Unordered stream elements < v, τ > < v, τ > hether they Sn ples on the S
ed from the olating the alled indefiperty as the
1
Stream emission Source φ1
Source φn
Execution engine & Operator implementation Digital Enterprise Research Institute
n n n
Data structure and physical storages for high-updaterate processing buffers Handling the new data stream elements/expired ones Operators && Incremental execution ¨
Stateless
¨
Stateful – – – –
n n n
www.deri.ie
Duplicate elimination Window Join Negation Aggregation
Memory overflow Dynamic Optimization of the continuous execution Schedule execution for fluctuate execution settings
eam Management Incremental execution of windowing operators Digital Enterprise Research Institute
Stateless
Window Join
www.deri.ie
Duplicate elimination
Aggregate Negation
What have “I” learned? Digital Enterprise Research Institute
www.deri.ie
n
12-15 years of techniques/algorithms/solutions for general stream processing (DSMS)
n
Only few prototypes and commercial products ¨
STREAM ,Borealis/Aurora, etc
¨
StreamBase, IBM InfoSphere streams, etc
n
Don’t take for granted!!!
n
DSMS is not mature as DBMS(>40 years)
Processing Linked Stream Data In A Nutshell Digital Enterprise Research Institute
www.deri.ie
Continuous Query
Static RDF datasets
Stream Data in RDF
Black-box approach Digital Enterprise Research Institute
www.deri.ie
Query
Query rewriter
Optimizer
Sout
C
Overhead
.. .
./124
3
!
!
[[P1 , t]]S1
[[P2 , t]]S2
4
2
w no
ge
2
1
[[P3 , t]]G
ra n
Executor
./12
S
.. .
Orchestrator
Execution
Access methods
Data transformation
SPARQL-like
Operator implementations
C-SPARQL Digital Enterprise Research Institute
www.deri.ie
CSPARQL to SPARQL
Query Rewriter
SPARQL Engine Data transformation
Data transformation
ESPER
Query Rewriter
CSPARQL to EPSER EPL
C-SPARQL
Orchestrator RDF to Java objects
C-SPARQL execution process Digital Enterprise Research Institute
www.deri.ie STREAM/CQL
market.trdf (broker: integer, tx:
Result Projection
RDF Triples
CONSTRUCT DESCRIBE Aggregation
Variable Bindings
QUERY ASK
CREATE VIEW streaming AS SELECT * FROM
WHERE Bindings
Static RDF Data
RDF Data Streams
Stream Transcoder Sesame/SPARQL
RDF Knowledge Base Static
Then the following CQL statement ing part of the example C-SPAR the nodes in the lower right bran that the stream operator and wind O-Graph map to CQL features ra this translation is straightforward.
Stream Manager
Data Streams Streaming
Figure 4: Query evaluation process
The last step of the query evalu of the aggregations specified in th C-SPARQL, the semantics of agg in SQL and, incidentally, also CQ AGGREGATE clauses cannot be direct aggregation function together wi The main di↵erence between C-S in C-SPARQL, aggregation does n of the result set, whereas the SQ tion has this characteristic.Howev can easily be emulated in CQL by tion in a separate view and then us column” with the aggregated value lation given above. The following t evaluate the aggregation operator Graph according to these semanti
Data transformation
CREATE VIEW arregation1 AS SELECT broker, SUM(amount) AS to FROM filtered GROUP BY broker
s extended with the additional operators given in Table 1. We also propose to represent the SPARQL filter clause as a node on its own instead of as an attribute of the graph pattern operator node as suggested in [20]. This modificaDigital Enterprise Research Institute ion of SQGM is necessary because filter clauses can also be used in C-SPARQL within the AGGREGATE clause, which is Operator Meaning and Properties ndependent of the WHERE clause. Stream Accesses an RDF Stream identified by its To illustrate the construction of O-Graphs, we use an exIRI. ample query computing the daily sumStream of allusing transactions of Window Breaks down an RDF a windowby ! and returns an RDFIn graph. at least 10 Euros done Swiss brokers. C-SPARQL:
C-SPARQL query rewriting www.deri.ie
?broker
?total
[[P9]] := ([[P8]]D) ?broker [[P8]] :=
?total {?broker, ?total}([[P7]]D)
Aggregation Based on an input set of variable bind?broker ?amount ?total REGISTER QUERY TotalAmountPerDayAndBroker AS given ings, computes the aggregate values [[P ]] := [[P AGG A(v, f, p, G)]]D 7 D 6 by A(v, f, p, G). A(v, f, p, G) := A(?total, sum, {?amount}, {?broker}) PREFIX Filter b: Based on an input set of variable bindings, PREFIX x: filters out the bindings that do not match ?broker ?country ?tx ?amount the filtering condition R. [[P6]]D := [[P5 FILTER R]]D tx SELECT DISTINCT ?broker ?total R := (?country = ‘CH’ ?speed 10) FROM Table 1: Additional SQGM operator types FROM STREAM ?broker ?country ?tx ?amount [RANGE 24h TUMBLING] [[P5]]D := [[P1 AND P4]]D WHERE { A?broker b:is_from ?country . logical window is defined as: x:does ?tx . ?broker ?tx ?amount !?broker l (R, ti , tf ) = {(hs, p, oi, ⌧ ) 2 R | ti < ⌧ tf } ?tx x:with ?amount . [[P4]]D := [[P2 AND P3]]D Let c(R, ti , tf ) be a function which counts the items in R FILTER (?country = ’CH’ && ?amount >= 10) which have timestamp in the range (ti , tf ]. } ?broker ?country ?tx ?amount c(R, ti , tf ) = | {(hs, p, oi, ⌧ ) 2 R | ti < ⌧ tf } | AGGREGATE { (?total, SUM(?amount), ?broker) } [[P1]]D := [[?broker a:from ?country]]D [[P3]]D := [[?tx x:with ?amount]]D A physical window is defined as: !p (R, n) = {(hs, p, oi, ⌧ ) 2 !l (R, ti , tf ) | c(R, ti , tf ) = n} The O-Graph corresponding therange example query. For is shown ?broker ?tx A window ! can be sliding,to with ⇢ and step [[P2]]D := [[?broker x:does ?tx]]D logical windows, ⇢ and take the form of a time interval. Logical windows (a) contain the most recent triples in a time interval of length ⇢; and (b) are evaluated with frequency logical(24 hours) 1/ . For physical windows, ⇢ and are integers. Physical windows (a) contain the last ⇢ triples; and (b) are evaluated whenever new triples arrive in the stream. A window is said to be as tumbling with range ⇢ if it is sliding with range
SPARQLStream Digital Enterprise Research Institute
www.deri.ie
SPARQLstream to SNEE language
Query Rewriter
SNEE Engine Relation to RDF
Data transformation
SPARQLstream
Orchestrator
SPARQLstream: Ontology-based mapping Digital Enterprise Research Institute
www.deri.ie
An example query of SPARQLstream Digital Enterprise Research Institute
PREFIX f i r e : PREFIX r d f : SELECT RSTREAM ? WindSpeedAvg FROM STREAM [FROM NOW 10 MINUTES TO NOW STEP 1 MINUTE ] FROM STREAM [FROM NOW HOURS TO NOW 2 HOURS STEP 1 MINUTE ] WHERE { { SELECT AVG( ? s p e e d ) AS ? WindSpeedAvg WHERE { GRAPH { ? WindSpeed a f i r e : WindSpeedMeasurement ; f i r e : hasSpeed ? speed ; } } GROUP BY ? WindSpeed } { SELECT AVG( ? a r c h i v e d S p e e d ) AS ? W i n d S p e e d H i s t o r y A v g WHERE { GRAPH { ? ArchWindSpeed a f i r e : WindSpeedMeasurement ; f i r e : hasSpeed ? archivedSpeed ; } } GROUP BY ? ArchWindSpeed } FILTER ( ? WindSpeedAvg > ? W i n d S p e e d H i s t o r y A v g ) }
www.deri.ie
3
everyListing minute 1. computes the sparql average windquery speed measurement each sensor An example which every minutefor computes the Stream over average the lastwind 10 minutes if it is higher thansensor the average the 2 to if3ithours. speed measurement for each over theof last 10last minutes is
An example of mapping rule in S O
2 element. It and its iri is specified in the s2 o mapping using the virtualStream Digital Enterprise Research can be specified at Institute the conceptmap-def level or at the attributemap-def level. streamschema d e s c name M e t e o S e n s o r s has s t r e a m S e n s o r W i n d streamType p u s h e d d o c u m e n t a t i o n ”Wind m e a s u r e m e n t s ” keycol desc measurementId columnType i n t e g e r Relational stream timestamp d e s c measureTime columnType d a t e t i m e nonkeycol desc measureSpeed columnType f l o a t nonkeycol desc m e a s u r e D i r e c t i o n columnType f l o a t ... conceptmap d e f Wind v i r t u a l S t r e a m u r i as c o n c a t ( S e n s o r W i n d . measurementID ) applies i f d e s c r i b e d by attributemap def hasSpeed v i r t u a l S t r e a m h t t p : / / s e m s o r g r i d 4 e n v . eu / S e n s o r R e a d i n g s . s r d f > operation constant has column S e n s o r W i n d . m e a s u r e S p e e d
www.deri.ie
to be mapped
Map the columns with ontological properties
Listing 2. An example s2 o declaration of a data stream schema and mapping S2O declaration of a data stream schema and mapping from a stream schema to an ontology concept.
from a stream schema to an ontology concept.
EP-SPARQL Digital Enterprise Research Institute
www.deri.ie
EP-SPARQL to Prolog
Query Rewriter
Prolog Engine RDF to prolog facts
Data transformation
EP-SPARQL
EP-SPARQL
Orchestrator
EP-SPARQL Digital Enterprise Research Institute
www.deri.ie
n
Execution mechanism : Prolog-based event-driven backward chaining (EDBC) rules
n
Representation ¨ ¨
n
RDF triple (s,p,o) èpredicate triple(s,p,o) Time-stamped RDF triple (s,p,o,t1,t2) è predicate triple(s,p,o,T1,T2)
Operators rewriting ¨ ¨
Operators (SeqJoin, Filters, etc) are rewritten in Prolog rules Two types of EDBC rules – Goal-insertion rules : to create intermediate goals of incoming events – Checking-rule: check if intermediate goals are triggered
Whitebox approach: Streaming SPARQL and CQELS Digital Enterprise Research Institute
www.deri.ie
Query
Triple-based physical operators
Optimizer
Operator implementations Sout
C
.. .
./124
3
./12
[[P3 , t]]G
!
!
[[P1 , t]]S1
[[P2 , t]]S2
4
2
w no
ge
2
1
ra n
Executor
S
.. .
Execution
Access methods
RDF stream RDF datasets
Data structures and physical storage for triplebased data elements
Streaming SPARQL Digital Enterprise Research Institute
www.deri.ie
Query
Extension of SPARQL physical operators for windowing graph patterns
Optimizer
Operator implementations Sout
C
.. .
./124
3
./12
[[P3 , t]]G
!
!
[[P1 , t]]S1
[[P2 , t]]S2
4
2
w no
ge
2
1
ra n
Executor
S
.. .
Execution
Access methods
RDF stream RDF datasets
Extends SweepArea for triple-based stream elements
HERE
?w my:name ? x
UNION
Fig. plan for Query?1z ? y1. Query my:power
Examples of executing physical operators of As you can1.2. see, Query the triple1patterns Streaming SPARQL engine Listing Digital Enterprise Research Institute
and both transformed to one triple pattern matching operator.www.deri.ie By this, th are filtered from the source and transformed into SPARQL solution union and project operators can process the triple pattern matching r Window definitions in the FROM parts of a query are placed d corresponding triple pattern matching operator as demonstrated in 1.3 and Figure 2.
SELECT ?w ?x ?y ?z FROM STREAM WINDOW RANGE 1000 SLIDE WHERE {?w my:name ?x} UNION {?y my:power ?z}
SELECT ?w ? x ? y ? z s r c . n e t g r a p h . r d f WINDOW RANGE FROM STREAM h t t p : UNION ? y my:power ? z WHERE ?w my:name ? x Listing 1.3. Query 2
Fig. 1. Query plan for Query 1 PREFIX wtur: SELECT ?x ?y ?z FROM STREAM WINDOW RANGE 30 MINUTE SLIDE WHERE {?x wtur:StrCnt ?y . OPTIONAL {?x wtur:StopCnt ?z . {WINDOW ELEMS 1500}}
Fig. 2. Query plan for Query 2
As you can see, the triple patterns and th transformed to one triple pattern matching operator. By this, the RDF stat e filtered from the source and transformed into SPARQL solutions. Afterwa ion and project operators can process the triple pattern matching results. Window definitions in the FROM parts of a query are placed directly bef
CQELS architecture for adaptive and native processing Digital Enterprise Research Institute
www.deri.ie
RDF!streams/SPARQL6Result!streams!
Decoder
Dictionary
C ./
./
./
G
[range 3s] [now ]
C
[now ]
G
[range 3s]
./ ./
./ [range 3s]
G
[now ] [range 3s]
Window' 'Buffer'Manager'
C ./
[now ]
G
Cache'Manager'
Encoder
Input Manager
RDF!streams!
Cache Fetcher
RDF stores/SPARQL Endpoints!
CQELS!queries!
C ./
Adaptive Optimizer
Dynamic Executor
Adaptive execution of CQELS 10
D. Le-Phuoc, M. Dao-Tran, J. X. Parreira and M. Hauswirth
Digital Enterprise Research Institute
www.deri.ie
CONSTRUCT {?person1 lv:reachable ?person2} FROM NAMED WHERE { STREAM [NOW] {?person1 lv:detectedAt ?loc1} STREAM [RANGE 3s] {?person2 lv:detectedAt ?loc2} GRAPH {?loc1 lv:connected ?loc2} }
D. Le-Phuoc, M. Dao-Tran, J. X. Parreira and Q2 M. Hauswirth Query SELECT ?coAuthName FROM NAMED C WHERE { C STREAM [TRIPLES 1] ./ STREAM ./ [RANGE 5s] { ?paper dc:creator ?auth. ?paper dc:creator ?auth foaf:name ‘‘$Name$’’. ?coAuth foaf:name FILTER (?auth != ?coAuth) }
./
./ [range 3s]
G
SELECT ?editorName
[now 3s] WHERE] [range {
[now ]
G
Query Q3
C
C
{?auth lv:detectedAt ?loc} ./ ./ {?coAuth lv:detectedAt ?loc} ?coAuth. ?coAuthorName}
./
G
[range 3s] [now ]
./
[range 3s]
[now ]
G
STREAM [TRIPLES 1] {?auth lv:detectedAt ?loc1} (a) From window now (b) From window range ?loc2} 3s STREAM [RANGE 15s] {?editor lv:detectedAt GRAPH {?loc1 lv:connected ?loc2} ?paper dc:creator ?auth. ?paper dcterms:partOf ?proceeding. ?proceeding swrc:editor ?editor. ?editor foaf:name ?editorName.
Fig. 3: Left-deep data flows for the query in the localisation scenario.
32
System Comparisons
D. Le-Phuoc, J. X. Parreira and M. Hauswirth
Digital Enterprise Research Institute
www.deri.ie
Input Streaming SPARQL
Query language
Extras
RDF stream
C-SPARQL
RDF Stream & RDF
TF
EP-SPARQL
RDF Stream & RDF
EVENT,TF
Event operators
Relational stream
NEST
Ontology-based mapping
RDF Stream & RDF
VoS,NEST
Disk spilling
SPARQLstream CQELS
Data 33 ID TF: built-in time functions EVENT: event pattern NEST:Linked VoS:Processing Variables on streams nestedStream patterns
Table 3: System comparison by features. Architecture
Re-execution
Scheduling
Optimisation
Streaming SPARQL whitebox periodical Logical plan & Static data and computing environment. As considering triples as theAlgebraic first-class data C-SPARQL blackbox periodical Logical plan & Static elements, CQELS engine employs both efficient data structuresAlgebraic for sliding winEP-SPARQL blackbox eager Logic program Externalised dows and triple storages to provide high-throughput native access methods on SPARQLstream periodical External call Externalised RDF dataset and RDFblackbox data streams. By adapting the implementations of the CQELS whitebox eager Adaptive Physical & Adaptive physical operators introduced in Section 2.4, CQELS is equipped with an adapphysical plans tive caching mechanism [14] and indexing schema [63,125,50,130] to accelerate the processing in Table physical operators and methods. 4: Comparisons by access execution mechanism. Similar to other systems, CQELS engine extended SPARQL 1.1 for contin-
Experiment setup for performance comparisons Digital Enterprise Research Institute
www.deri.ie
n
Conference scenario : combine linked stream from RFID tags (physical relationships) with DBLP data (social relationships)
n
Setup ¨
Systems : CQELS vs ETALIS and C-SPARQL
¨
Datasets
–
Replayed RFID data from Open Beacon deployments
– Simulated DBLP by SP2Bench ¨
Queries : 5 query templates with different complexities – Q1: selection, – Q2: stream joins, Q3,Q4 : Stream and non-stream joins – Q5: aggregation
¨
Experiments – Single query : generate 10 query instances of each template by varying the constants – Vary size of the DBLP (104-107triples) – Multiple queries : register 2M instances at the same time(0≤M≤10)
Performance comparison- Query execution time Digital Enterprise Research Institute
n
www.deri.ie
CQELS perform fasters by orders of magnitudes
Query 1
Query 2
Query 3
Query 4
Query5
CQELS
0.47
3.90
0.51
0.53
21.83
C-SPARQL
332.46
99.84
331.68
395.18
322.64
ETALIS
0.06
27.47
79.95
469.23
160.83
Aggregation: 15 times faster than C-SPARQL 8 times faster than ETALIS
Simple selection : ETALIS perform best Stream join : 25 times faster than C-SPARQL 8 times faster than ETALIS
Stream and non-stream joins : >600 times faster than C-SPARQL 150-850 times faster than ETALIS
Performance comparison– Scalability (non-stream data size) Digital Enterprise Research Institute
www.deri.ie
avg. query exc. time (ms) - log scale
Non-stream data size : logarithmic to size of static intermediate results 10000
CQELS C-SPARQL ETALIS
1000 100
Q1
10 1 0.1 0.01 10k
100k
1M
2M
10M
10000
CQELS C-SPARQL ETALIS
1000
100
Q3 10
1
0.1 10k
100k
1M
2M
Number of triples from DBLP
10M
avg. query exc. time (ms) - log scale
avg. query exc. time (ms) - log scale
Number of triples from DBLP 10000
CQELS C-SPARQL ETALIS
1000
Q5 100
10 10k
100k
1M
2M
Number of triples from DBLP
10M
Performance comparison- Scalability (number of queries) Digital Enterprise Research Institute
www.deri.ie avg. query exc. time (ms) - log scale
1e+06
CQELS 100000 C-SPARQL ETALIS 10000 1000 100 10 1
Q1
0.1 0.01 1
10
100
1000
1e+06
CQELS C-SPARQL 100000 ETALIS 10000 1000 100 10 1
Q3
0.1 1
10
100
1000
avg. query exc time (ms) - log scale
avg. query exc. time (ms) - log scale
Number of queries 1e+06 100000
CQELS C-SPARQL ETALIS
10000 1000 100 10 1
Q4
0.1 1
Number of queries
No multiple query optimization
10
100
Number of queries
1000
Performance comparison-Maximum input throughput Digital Enterprise Research Institute
www.deri.ie
•
•
C-SPARQL gives better throughput than CQELS?
•
Performance comparison – Memory consumption Digital Enterprise Research Institute
www.deri.ie
•
•
Are they ready for production? Digital Enterprise Research Institute
n n
n
www.deri.ie
Not quite !!!(just 4-6 years) Why? ¨
Functionality
¨
Performance
¨
Scalability
But interesting to test/compete/extend/study ¨
Vast amount of interesting of heterogeneous data streams and open Linked Data sources
¨
Plethora of use cases/applications
¨
New interesting research problems
Open challenges Digital Enterprise Research Institute
n
Serialization of RDF-based stream elements
n
Optimization & Scheduling
n
How to measure and compare performances
n
Early-stage
n
¨
Expressiveness
¨
Reasoning
Lack of functionalities ¨
Disk-based stream processing
¨
Distributed processing for large-scale data
www.deri.ie
How to serialize the RDF stream elementsGraph-based stream layout Digital Enterprise Research Institute
www.deri.ie
Sensor metadata
:weatherStation
ssn:observes
ssn:observes
:aTemperature
ssn:observedBy :aHumidity
ssn:isPropertyOf
ssn:isPropertyOf
:dublinAirport
ssm:observedPropery
ssm:observedPropery
ssn:featureOfInterest
:latestWeather :tempValue
ssn:hasValue ssm:value
“18”^xsd:flo at
:humidValue
ssn:observationResult
ssm:unit
“Celcius”
:readings
ssn:hasValue ssm:value
“60”^xsd:flo at
ssm:unit
“%”
Stream data snapshot at 2011-07-08T21:32:52
timestamp is assigned to all the triples of a graph. In both cases timestamps system-generated in monotonic nonHow to serialize theare RDF stream elements decreasing order. Results of two evaluations of the previous Digital Enterprise Research Institute www.deri.ie query are presented in the table below. Extension of NTRIPLE: line-based, plain text format
triple c:Distr1 t:has-entering-cars "100" c:Distr2 t:has-entering-cars "75" c:Distr1 t:has-entering-cars "130" c:Distr2 t:has-entering-cars "95" c:Distr3 t:has-entering-cars "65"
Timestamp t400 t400 t401 t401 t401
The first evaluation occurs at t400 . Suppose that only data Quad-form representation § Lack of standard wayc:Distr1 to serializeand RDF c:Distr2 Stream from two sources (i.e., ) are present in the window. Then, the evaluation generates two triples with § N-Triple-like representation is inefficient the same timestamp (i.e., t400 ). § 100-500 bytes to represent 1 integer reading The second evaluation occurs at t401 . Suppose that part of the data elaborated by the previous query are still in the § 100k triples/secè10-50MB/secè 80-400Mbps bandwidth window and that new data related to uc:Distr3 entered in § Is it necessary in text-line format? Binary format? the window. Then, the evaluation produces 3 triples; all of
Optimization & Scheduling Digital Enterprise Research Institute
n
At logical plan level èinefficient and restricted on highly dynamic settings of the stream processing.
n
Few at physical level but only with naïve/simplistic algorithms/strategies.
n
None support multiple query optimization
☞ Needs
www.deri.ie
more studies on optimization and scheduling graph-based query patterns
How to compare performance Digital Enterprise Research Institute
n
n
n
www.deri.ie
There are only few stream benchmarking systems ¨
Linear Road benchmark (VLDB 2004)
¨
LSBench (to appear at ISWC 2012)
¨
SRBench (to appear at ISWC 2012)
How to define the measurement to compare ¨
Execution time/Response time?
¨
Throughput?
Too many elements to cause differences in outputs ¨
Difference in semantics
¨
The execution mechanisms
¨
Execution environments and settings
Early-stage work Digital Enterprise Research Institute
n
Expressiveness ¨
SPARQL extensions based on relational algebra is not expressive enough for stream/event processing applications
¨
Higher expressive continuous query language? – Recursive expression – Rule-based expression – Support uncertainty in matching pattern
n
www.deri.ie
Stream reasoning based on RDF data model ¨
Emerging topic with some early work – Barbieri et al. Incremental Reasoning on Streams and Rich Background Knowledge (ESWC’2010). – Komazec et al. Sparkwave: continuous schema-enhanced pattern matching over RDF data streams (DEBS’2012)
¨
Complexity vs low latency è need quantitative metrics to judge the advantages of each stream reasoner
Lack of functionalities Digital Enterprise Research Institute
n
n
www.deri.ie
Disk-based stream processing ¨
Big windows
¨
Big linked data sets
Distributed stream processing ¨
Some general distributed stream processing platform/ systems – Borealis, StreamBase, IBM Stream Spheres, etc – S4, Storm, Kafka, etc
¨
Use black-box approach delegate processing è How to deal with the over head and restriction of optimization?
¨
Use whitebox approach è which physical processing can be reused from such platform/system? Can it be better?
Summary Digital Enterprise Research Institute
www.deri.ie
n
What is Linked Stream Data?
n
Data models for Linked Stream Data
n
Query operators and query languages
n
How to build a Linked Stream Processing Engine
n
Comparisons and analysis of State-Of-The-Art systems
n
Open challenges ¨ ¨ ¨ ¨
Serialization of RDF-based stream elements Optimization & Scheduling How to measure and compare performances Early-stage & Lack of functionalities