Part II: Linked Stream Data Processing - Building a Processing Engine

5 downloads 4516 Views 4MB Size Report
Handling input streams. □ Implement the execution engine. □ Schedule the executions. □ Optimization. What need to be done to build a processing engine?
Digital Enterprise Research Institute

www.deri.ie

Part II: Linked Stream Data Processing Building a Processing Engine Danh Le-Phuoc, Josiane X. Parreira, and Manfred Hauswirth DERI - National University of Ireland, Galway Reasoning Web Summer School 2012

© Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Enabling networked knowledge

Outline Digital Enterprise Research Institute

n 

n 

Part I: Basic Concepts & Modeling (Josi) ¨ 

Linked Stream Data

¨ 

Data models

¨ 

Query Languages and Operators

¨ 

Choices/Challenges when designing a Linked Stream Data processor

Part II: Building a Linked Stream Processing Engine (Danh) ¨ 

Analysis of available Linked Stream Processing Engines –  Design choices, implementation –  Performance comparison –  Open Challenges

www.deri.ie

“Why should I care?” Digital Enterprise Research Institute

n 

n 

n 

As an application developer ¨ 

What can I do with it?

¨ 

How does it work?

¨ 

How to choose one?

As a processing engine developer ¨ 

How to build one?

¨ 

How to build/identify a better one?

As a researcher ¨ 

What have been done? What left?

¨ 

Is there any interesting research problem?

¨ 

How to find room to improvement?

www.deri.ie

Why a continuous query processing engine is needed?

Digital Enterprise Research Institute

www.deri.ie

Separation of concerns n  n  n 

Focus of application logic Let the experts deal with data operations on streams Minimize the learning efforts ¨ 

Learn simple APIs using the engine

¨ 

Learn a simple query language

How a stream-based application is built with a stream processing engine? Digital Enterprise Research Institute

www.deri.ie

① 

Initialize the engine (less than 5 lines of code)

② 

Write and register the queries to the engine ( 1 line for 1 query)

③ 

Write codes for wiring output streams to the application logic (depends on the application logic but the each wiring code snippet ≈5 lines of code)

④ 

Connect input streams to engine ( 1-3 lines for each stream)

Just 10-20 lines of code!!!

What need to be done to build a processing engine? Digital Enterprise Research Institute

n  n 

n 

Data model : relational, object-oriented, etc Query model : ¨ 

Logical operators : sliding windows, relational algebras

¨ 

Query language: CQL, C-SPARQL,CQELS,etc

Build a processing engine ¨ 

Handling input streams

¨ 

Implement the execution engine

¨ 

Schedule the executions

¨ 

Optimization

www.deri.ie

Building blocks of a query processing engine Digital Enterprise Research Institute

www.deri.ie

Query

Optimizer

Operator implementations Sout

C

.. .

./124

3

./12

[[P3 , t]]G

!

!

[[P1 , t]]S1

[[P2 , t]]S2

4

2

w no

ge

2

1

ra n

Executor

S

.. .

Execution

Access methods

database datastream

Algorithms/technologies/solutions for stream processing Digital Enterprise Research Institute

n 

n 

n  n 

Handling live and push-based data stream sources ¨ 

Time management

¨ 

Load shedding for bursty streams

Operator implementation for execution engine ¨ 

Data structure and physical storage

¨ 

Handling the new stream elements/expired ones

¨ 

Incremental execution

¨ 

Memory overflow

Optimization Scheduling

www.deri.ie

Handling input streams

Answers Digital Enterprise www.deri.ie long a tu- Research Institute efore it can DSMS Query Query le, consider Processor Plans wo separate CQ1 CQ2 CQm . Further, nerate a tuTuples previously < τ strictly temporal order le the temf tuples on Heartbeat τ Buffered Input “paused”. tuples > τ Manager emperature Stream arrival o temperahe temperis informaL1 Network L n Load shedding for bursty rate the tuples < v, τ > < v, τ > y processor Unordered stream elements < v, τ > < v, τ > hether they Sn ples on the S

ed from the olating the alled indefiperty as the

1

Stream emission Source φ1

Source φn

Execution engine & Operator implementation Digital Enterprise Research Institute

n  n  n 

Data structure and physical storages for high-updaterate processing buffers Handling the new data stream elements/expired ones Operators && Incremental execution ¨ 

Stateless

¨ 

Stateful –  –  –  – 

n  n  n 

www.deri.ie

Duplicate elimination Window Join Negation Aggregation

Memory overflow Dynamic Optimization of the continuous execution Schedule execution for fluctuate execution settings

eam Management Incremental execution of windowing operators Digital Enterprise Research Institute

Stateless

Window Join

www.deri.ie

Duplicate elimination

Aggregate Negation

What have “I” learned? Digital Enterprise Research Institute

www.deri.ie

n 

12-15 years of techniques/algorithms/solutions for general stream processing (DSMS)

n 

Only few prototypes and commercial products ¨ 

STREAM ,Borealis/Aurora, etc

¨ 

StreamBase, IBM InfoSphere streams, etc

n 

Don’t take for granted!!!

n 

DSMS is not mature as DBMS(>40 years)

Processing Linked Stream Data In A Nutshell Digital Enterprise Research Institute

www.deri.ie

Continuous Query

Static RDF datasets

Stream Data in RDF

Black-box approach Digital Enterprise Research Institute

www.deri.ie

Query

Query rewriter

Optimizer

Sout

C

Overhead

.. .

./124

3

!

!

[[P1 , t]]S1

[[P2 , t]]S2

4

2

w no

ge

2

1

[[P3 , t]]G

ra n

Executor

./12

S

.. .

Orchestrator

Execution

Access methods

Data transformation

SPARQL-like

Operator implementations

C-SPARQL Digital Enterprise Research Institute

www.deri.ie

CSPARQL to SPARQL

Query Rewriter

SPARQL Engine Data transformation

Data transformation

ESPER

Query Rewriter

CSPARQL to EPSER EPL

C-SPARQL

Orchestrator RDF to Java objects

C-SPARQL execution process Digital Enterprise Research Institute

www.deri.ie STREAM/CQL

market.trdf (broker: integer, tx:

Result Projection

RDF Triples

CONSTRUCT DESCRIBE Aggregation

Variable Bindings

QUERY ASK

CREATE VIEW streaming AS SELECT * FROM

WHERE Bindings

Static RDF Data

RDF Data Streams

Stream Transcoder Sesame/SPARQL

RDF Knowledge Base Static

Then the following CQL statement ing part of the example C-SPAR the nodes in the lower right bran that the stream operator and wind O-Graph map to CQL features ra this translation is straightforward.

Stream Manager

Data Streams Streaming

Figure 4: Query evaluation process

The last step of the query evalu of the aggregations specified in th C-SPARQL, the semantics of agg in SQL and, incidentally, also CQ AGGREGATE clauses cannot be direct aggregation function together wi The main di↵erence between C-S in C-SPARQL, aggregation does n of the result set, whereas the SQ tion has this characteristic.Howev can easily be emulated in CQL by tion in a separate view and then us column” with the aggregated value lation given above. The following t evaluate the aggregation operator Graph according to these semanti

Data transformation

CREATE VIEW arregation1 AS SELECT broker, SUM(amount) AS to FROM filtered GROUP BY broker

s extended with the additional operators given in Table 1. We also propose to represent the SPARQL filter clause as a node on its own instead of as an attribute of the graph pattern operator node as suggested in [20]. This modificaDigital Enterprise Research Institute ion of SQGM is necessary because filter clauses can also be used in C-SPARQL within the AGGREGATE clause, which is Operator Meaning and Properties ndependent of the WHERE clause. Stream Accesses an RDF Stream identified by its To illustrate the construction of O-Graphs, we use an exIRI. ample query computing the daily sumStream of allusing transactions of Window Breaks down an RDF a windowby ! and returns an RDFIn graph. at least 10 Euros done Swiss brokers. C-SPARQL:

C-SPARQL query rewriting www.deri.ie

?broker

?total

[[P9]] := ([[P8]]D) ?broker [[P8]] :=

?total {?broker, ?total}([[P7]]D)

Aggregation Based on an input set of variable bind?broker ?amount ?total REGISTER QUERY TotalAmountPerDayAndBroker AS given ings, computes the aggregate values [[P ]] := [[P AGG A(v, f, p, G)]]D 7 D 6 by A(v, f, p, G). A(v, f, p, G) := A(?total, sum, {?amount}, {?broker}) PREFIX Filter b: Based on an input set of variable bindings, PREFIX x: filters out the bindings that do not match ?broker ?country ?tx ?amount the filtering condition R. [[P6]]D := [[P5 FILTER R]]D tx SELECT DISTINCT ?broker ?total R := (?country = ‘CH’ ?speed 10) FROM Table 1: Additional SQGM operator types FROM STREAM ?broker ?country ?tx ?amount [RANGE 24h TUMBLING] [[P5]]D := [[P1 AND P4]]D WHERE { A?broker b:is_from ?country . logical window is defined as: x:does ?tx . ?broker ?tx ?amount !?broker l (R, ti , tf ) = {(hs, p, oi, ⌧ ) 2 R | ti < ⌧  tf } ?tx x:with ?amount . [[P4]]D := [[P2 AND P3]]D Let c(R, ti , tf ) be a function which counts the items in R FILTER (?country = ’CH’ && ?amount >= 10) which have timestamp in the range (ti , tf ]. } ?broker ?country ?tx ?amount c(R, ti , tf ) = | {(hs, p, oi, ⌧ ) 2 R | ti < ⌧  tf } | AGGREGATE { (?total, SUM(?amount), ?broker) } [[P1]]D := [[?broker a:from ?country]]D [[P3]]D := [[?tx x:with ?amount]]D A physical window is defined as: !p (R, n) = {(hs, p, oi, ⌧ ) 2 !l (R, ti , tf ) | c(R, ti , tf ) = n} The O-Graph corresponding therange example query. For is shown ?broker ?tx A window ! can be sliding,to with ⇢ and step [[P2]]D := [[?broker x:does ?tx]]D logical windows, ⇢ and take the form of a time interval. Logical windows (a) contain the most recent triples in a time interval of length ⇢; and (b) are evaluated with frequency logical(24 hours) 1/ . For physical windows, ⇢ and are integers. Physical windows (a) contain the last ⇢ triples; and (b) are evaluated whenever new triples arrive in the stream. A window is said to be as tumbling with range ⇢ if it is sliding with range

SPARQLStream Digital Enterprise Research Institute

www.deri.ie

SPARQLstream to SNEE language

Query Rewriter

SNEE Engine Relation to RDF

Data transformation

SPARQLstream

Orchestrator

SPARQLstream: Ontology-based mapping Digital Enterprise Research Institute

www.deri.ie

An example query of SPARQLstream Digital Enterprise Research Institute

PREFIX f i r e : PREFIX r d f : SELECT RSTREAM ? WindSpeedAvg FROM STREAM [FROM NOW 10 MINUTES TO NOW STEP 1 MINUTE ] FROM STREAM [FROM NOW HOURS TO NOW 2 HOURS STEP 1 MINUTE ] WHERE { { SELECT AVG( ? s p e e d ) AS ? WindSpeedAvg WHERE { GRAPH { ? WindSpeed a f i r e : WindSpeedMeasurement ; f i r e : hasSpeed ? speed ; } } GROUP BY ? WindSpeed } { SELECT AVG( ? a r c h i v e d S p e e d ) AS ? W i n d S p e e d H i s t o r y A v g WHERE { GRAPH { ? ArchWindSpeed a f i r e : WindSpeedMeasurement ; f i r e : hasSpeed ? archivedSpeed ; } } GROUP BY ? ArchWindSpeed } FILTER ( ? WindSpeedAvg > ? W i n d S p e e d H i s t o r y A v g ) }

www.deri.ie

3

everyListing minute 1. computes the sparql average windquery speed measurement each sensor An example which every minutefor computes the Stream over average the lastwind 10 minutes if it is higher thansensor the average the 2 to if3ithours. speed measurement for each over theof last 10last minutes is

An example of mapping rule in S O

2 element. It and its iri is specified in the s2 o mapping using the virtualStream Digital Enterprise Research can be specified at Institute the conceptmap-def level or at the attributemap-def level. streamschema d e s c name M e t e o S e n s o r s has s t r e a m S e n s o r W i n d streamType p u s h e d d o c u m e n t a t i o n ”Wind m e a s u r e m e n t s ” keycol desc measurementId columnType i n t e g e r Relational stream timestamp d e s c measureTime columnType d a t e t i m e nonkeycol desc measureSpeed columnType f l o a t nonkeycol desc m e a s u r e D i r e c t i o n columnType f l o a t ... conceptmap d e f Wind v i r t u a l S t r e a m u r i as c o n c a t ( S e n s o r W i n d . measurementID ) applies i f d e s c r i b e d by attributemap def hasSpeed v i r t u a l S t r e a m h t t p : / / s e m s o r g r i d 4 e n v . eu / S e n s o r R e a d i n g s . s r d f > operation constant has column S e n s o r W i n d . m e a s u r e S p e e d

www.deri.ie

to be mapped

Map the columns with ontological properties

Listing 2. An example s2 o declaration of a data stream schema and mapping S2O declaration of a data stream schema and mapping from a stream schema to an ontology concept.

from a stream schema to an ontology concept.

EP-SPARQL Digital Enterprise Research Institute

www.deri.ie

EP-SPARQL to Prolog

Query Rewriter

Prolog Engine RDF to prolog facts

Data transformation

EP-SPARQL

EP-SPARQL

Orchestrator

EP-SPARQL Digital Enterprise Research Institute

www.deri.ie

n 

Execution mechanism : Prolog-based event-driven backward chaining (EDBC) rules

n 

Representation ¨  ¨ 

n 

RDF triple (s,p,o) èpredicate triple(s,p,o) Time-stamped RDF triple (s,p,o,t1,t2) è predicate triple(s,p,o,T1,T2)

Operators rewriting ¨  ¨ 

Operators (SeqJoin, Filters, etc) are rewritten in Prolog rules Two types of EDBC rules –  Goal-insertion rules : to create intermediate goals of incoming events –  Checking-rule: check if intermediate goals are triggered

Whitebox approach: Streaming SPARQL and CQELS Digital Enterprise Research Institute

www.deri.ie

Query

Triple-based physical operators

Optimizer

Operator implementations Sout

C

.. .

./124

3

./12

[[P3 , t]]G

!

!

[[P1 , t]]S1

[[P2 , t]]S2

4

2

w no

ge

2

1

ra n

Executor

S

.. .

Execution

Access methods

RDF stream RDF datasets

Data structures and physical storage for triplebased data elements

Streaming SPARQL Digital Enterprise Research Institute

www.deri.ie

Query

Extension of SPARQL physical operators for windowing graph patterns

Optimizer

Operator implementations Sout

C

.. .

./124

3

./12

[[P3 , t]]G

!

!

[[P1 , t]]S1

[[P2 , t]]S2

4

2

w no

ge

2

1

ra n

Executor

S

.. .

Execution

Access methods

RDF stream RDF datasets

Extends SweepArea for triple-based stream elements

HERE

?w my:name ? x

UNION

Fig. plan for Query?1z ? y1. Query my:power

Examples of executing physical operators of As you can1.2. see, Query the triple1patterns Streaming SPARQL engine Listing Digital Enterprise Research Institute

and both transformed to one triple pattern matching operator.www.deri.ie By this, th are filtered from the source and transformed into SPARQL solution union and project operators can process the triple pattern matching r Window definitions in the FROM parts of a query are placed d corresponding triple pattern matching operator as demonstrated in 1.3 and Figure 2.

SELECT ?w ?x ?y ?z FROM STREAM WINDOW RANGE 1000 SLIDE WHERE {?w my:name ?x} UNION {?y my:power ?z}

SELECT ?w ? x ? y ? z s r c . n e t g r a p h . r d f WINDOW RANGE FROM STREAM h t t p : UNION ? y my:power ? z WHERE ?w my:name ? x Listing 1.3. Query 2

Fig. 1. Query plan for Query 1 PREFIX wtur: SELECT ?x ?y ?z FROM STREAM WINDOW RANGE 30 MINUTE SLIDE WHERE {?x wtur:StrCnt ?y . OPTIONAL {?x wtur:StopCnt ?z . {WINDOW ELEMS 1500}}

Fig. 2. Query plan for Query 2

As you can see, the triple patterns and th transformed to one triple pattern matching operator. By this, the RDF stat e filtered from the source and transformed into SPARQL solutions. Afterwa ion and project operators can process the triple pattern matching results. Window definitions in the FROM parts of a query are placed directly bef

CQELS architecture for adaptive and native processing Digital Enterprise Research Institute

www.deri.ie

RDF!streams/SPARQL6Result!streams!

Decoder

Dictionary

C ./

./

./

G

[range 3s] [now ]

C

[now ]

G

[range 3s]

./ ./

./ [range 3s]

G

[now ] [range 3s]

Window' 'Buffer'Manager'

C ./

[now ]

G

Cache'Manager'

Encoder

Input Manager

RDF!streams!

Cache Fetcher

RDF stores/SPARQL Endpoints!

CQELS!queries!

C ./

Adaptive Optimizer

Dynamic Executor

Adaptive execution of CQELS 10

D. Le-Phuoc, M. Dao-Tran, J. X. Parreira and M. Hauswirth

Digital Enterprise Research Institute

www.deri.ie

CONSTRUCT {?person1 lv:reachable ?person2} FROM NAMED WHERE { STREAM [NOW] {?person1 lv:detectedAt ?loc1} STREAM [RANGE 3s] {?person2 lv:detectedAt ?loc2} GRAPH {?loc1 lv:connected ?loc2} }

D. Le-Phuoc, M. Dao-Tran, J. X. Parreira and Q2 M. Hauswirth Query SELECT ?coAuthName FROM NAMED C WHERE { C STREAM [TRIPLES 1] ./ STREAM ./ [RANGE 5s] { ?paper dc:creator ?auth. ?paper dc:creator ?auth foaf:name ‘‘$Name$’’. ?coAuth foaf:name FILTER (?auth != ?coAuth) }

./

./ [range 3s]

G

SELECT ?editorName

[now 3s] WHERE] [range {

[now ]

G

Query Q3

C

C

{?auth lv:detectedAt ?loc} ./ ./ {?coAuth lv:detectedAt ?loc} ?coAuth. ?coAuthorName}

./

G

[range 3s] [now ]

./

[range 3s]

[now ]

G

STREAM [TRIPLES 1] {?auth lv:detectedAt ?loc1} (a) From window now (b) From window range ?loc2} 3s STREAM [RANGE 15s] {?editor lv:detectedAt GRAPH {?loc1 lv:connected ?loc2} ?paper dc:creator ?auth. ?paper dcterms:partOf ?proceeding. ?proceeding swrc:editor ?editor. ?editor foaf:name ?editorName.

Fig. 3: Left-deep data flows for the query in the localisation scenario.

32

System Comparisons

D. Le-Phuoc, J. X. Parreira and M. Hauswirth

Digital Enterprise Research Institute

www.deri.ie

Input Streaming SPARQL

Query language

Extras

RDF stream

C-SPARQL

RDF Stream & RDF

TF

EP-SPARQL

RDF Stream & RDF

EVENT,TF

Event operators

Relational stream

NEST

Ontology-based mapping

RDF Stream & RDF

VoS,NEST

Disk spilling

SPARQLstream CQELS

Data 33 ID TF: built-in time functions EVENT: event pattern NEST:Linked VoS:Processing Variables on streams nestedStream patterns

Table 3: System comparison by features. Architecture

Re-execution

Scheduling

Optimisation

Streaming SPARQL whitebox periodical Logical plan & Static data and computing environment. As considering triples as theAlgebraic first-class data C-SPARQL blackbox periodical Logical plan & Static elements, CQELS engine employs both efficient data structuresAlgebraic for sliding winEP-SPARQL blackbox eager Logic program Externalised dows and triple storages to provide high-throughput native access methods on SPARQLstream periodical External call Externalised RDF dataset and RDFblackbox data streams. By adapting the implementations of the CQELS whitebox eager Adaptive Physical & Adaptive physical operators introduced in Section 2.4, CQELS is equipped with an adapphysical plans tive caching mechanism [14] and indexing schema [63,125,50,130] to accelerate the processing in Table physical operators and methods. 4: Comparisons by access execution mechanism. Similar to other systems, CQELS engine extended SPARQL 1.1 for contin-

Experiment setup for performance comparisons Digital Enterprise Research Institute

www.deri.ie

n 

Conference scenario : combine linked stream from RFID tags (physical relationships) with DBLP data (social relationships)

n 

Setup ¨ 

Systems : CQELS vs ETALIS and C-SPARQL

¨ 

Datasets

– 

Replayed RFID data from Open Beacon deployments

–  Simulated DBLP by SP2Bench ¨ 

Queries : 5 query templates with different complexities –  Q1: selection, –  Q2: stream joins, Q3,Q4 : Stream and non-stream joins –  Q5: aggregation

¨ 

Experiments –  Single query : generate 10 query instances of each template by varying the constants –  Vary size of the DBLP (104-107triples) –  Multiple queries : register 2M instances at the same time(0≤M≤10)

Performance comparison- Query execution time Digital Enterprise Research Institute

n 

www.deri.ie

CQELS perform fasters by orders of magnitudes

Query 1

Query 2

Query 3

Query 4

Query5

CQELS

0.47

3.90

0.51

0.53

21.83

C-SPARQL

332.46

99.84

331.68

395.18

322.64

ETALIS

0.06

27.47

79.95

469.23

160.83

Aggregation: 15 times faster than C-SPARQL 8 times faster than ETALIS

Simple selection : ETALIS perform best Stream join : 25 times faster than C-SPARQL 8 times faster than ETALIS

Stream and non-stream joins : >600 times faster than C-SPARQL 150-850 times faster than ETALIS

Performance comparison– Scalability (non-stream data size) Digital Enterprise Research Institute

www.deri.ie

avg. query exc. time (ms) - log scale

Non-stream data size : logarithmic to size of static intermediate results 10000

CQELS C-SPARQL ETALIS

1000 100

Q1

10 1 0.1 0.01 10k

100k

1M

2M

10M

10000

CQELS C-SPARQL ETALIS

1000

100

Q3 10

1

0.1 10k

100k

1M

2M

Number of triples from DBLP

10M

avg. query exc. time (ms) - log scale

avg. query exc. time (ms) - log scale

Number of triples from DBLP 10000

CQELS C-SPARQL ETALIS

1000

Q5 100

10 10k

100k

1M

2M

Number of triples from DBLP

10M

Performance comparison- Scalability (number of queries) Digital Enterprise Research Institute

www.deri.ie avg. query exc. time (ms) - log scale

1e+06

CQELS 100000 C-SPARQL ETALIS 10000 1000 100 10 1

Q1

0.1 0.01 1

10

100

1000

1e+06

CQELS C-SPARQL 100000 ETALIS 10000 1000 100 10 1

Q3

0.1 1

10

100

1000

avg. query exc time (ms) - log scale

avg. query exc. time (ms) - log scale

Number of queries 1e+06 100000

CQELS C-SPARQL ETALIS

10000 1000 100 10 1

Q4

0.1 1

Number of queries

No multiple query optimization

10

100

Number of queries

1000

Performance comparison-Maximum input throughput Digital Enterprise Research Institute

www.deri.ie





C-SPARQL gives better throughput than CQELS?



Performance comparison – Memory consumption Digital Enterprise Research Institute

www.deri.ie





Are they ready for production? Digital Enterprise Research Institute

n  n 

n 

www.deri.ie

Not quite !!!(just 4-6 years) Why? ¨ 

Functionality

¨ 

Performance

¨ 

Scalability

But interesting to test/compete/extend/study ¨ 

Vast amount of interesting of heterogeneous data streams and open Linked Data sources

¨ 

Plethora of use cases/applications

¨ 

New interesting research problems

Open challenges Digital Enterprise Research Institute

n 

Serialization of RDF-based stream elements

n 

Optimization & Scheduling

n 

How to measure and compare performances

n 

Early-stage

n 

¨ 

Expressiveness

¨ 

Reasoning

Lack of functionalities ¨ 

Disk-based stream processing

¨ 

Distributed processing for large-scale data

www.deri.ie

How to serialize the RDF stream elementsGraph-based stream layout Digital Enterprise Research Institute

www.deri.ie

Sensor metadata

:weatherStation

ssn:observes

ssn:observes

:aTemperature

ssn:observedBy :aHumidity

ssn:isPropertyOf

ssn:isPropertyOf

:dublinAirport

ssm:observedPropery

ssm:observedPropery

ssn:featureOfInterest

:latestWeather :tempValue

ssn:hasValue ssm:value

“18”^xsd:flo at

:humidValue

ssn:observationResult

ssm:unit

“Celcius”

:readings

ssn:hasValue ssm:value

“60”^xsd:flo at

ssm:unit

“%”

Stream data snapshot at 2011-07-08T21:32:52

timestamp is assigned to all the triples of a graph. In both cases timestamps system-generated in monotonic nonHow to serialize theare RDF stream elements decreasing order. Results of two evaluations of the previous Digital Enterprise Research Institute www.deri.ie query are presented in the table below. Extension of NTRIPLE: line-based, plain text format

triple c:Distr1 t:has-entering-cars "100" c:Distr2 t:has-entering-cars "75" c:Distr1 t:has-entering-cars "130" c:Distr2 t:has-entering-cars "95" c:Distr3 t:has-entering-cars "65"

Timestamp t400 t400 t401 t401 t401

The first evaluation occurs at t400 . Suppose that only data Quad-form representation §  Lack of standard wayc:Distr1 to serializeand RDF c:Distr2 Stream from two sources (i.e., ) are present in the window. Then, the evaluation generates two triples with §  N-Triple-like representation is inefficient the same timestamp (i.e., t400 ). §  100-500 bytes to represent 1 integer reading The second evaluation occurs at t401 . Suppose that part of the data elaborated by the previous query are still in the §  100k triples/secè10-50MB/secè 80-400Mbps bandwidth window and that new data related to uc:Distr3 entered in §  Is it necessary in text-line format? Binary format? the window. Then, the evaluation produces 3 triples; all of

Optimization & Scheduling Digital Enterprise Research Institute

n 

At logical plan level èinefficient and restricted on highly dynamic settings of the stream processing.

n 

Few at physical level but only with naïve/simplistic algorithms/strategies.

n 

None support multiple query optimization

☞  Needs

www.deri.ie

more studies on optimization and scheduling graph-based query patterns

How to compare performance Digital Enterprise Research Institute

n 

n 

n 

www.deri.ie

There are only few stream benchmarking systems ¨ 

Linear Road benchmark (VLDB 2004)

¨ 

LSBench (to appear at ISWC 2012)

¨ 

SRBench (to appear at ISWC 2012)

How to define the measurement to compare ¨ 

Execution time/Response time?

¨ 

Throughput?

Too many elements to cause differences in outputs ¨ 

Difference in semantics

¨ 

The execution mechanisms

¨ 

Execution environments and settings

Early-stage work Digital Enterprise Research Institute

n 

Expressiveness ¨ 

SPARQL extensions based on relational algebra is not expressive enough for stream/event processing applications

¨ 

Higher expressive continuous query language? –  Recursive expression –  Rule-based expression –  Support uncertainty in matching pattern

n 

www.deri.ie

Stream reasoning based on RDF data model ¨ 

Emerging topic with some early work –  Barbieri et al. Incremental Reasoning on Streams and Rich Background Knowledge (ESWC’2010). –  Komazec et al. Sparkwave: continuous schema-enhanced pattern matching over RDF data streams (DEBS’2012)

¨ 

Complexity vs low latency è need quantitative metrics to judge the advantages of each stream reasoner

Lack of functionalities Digital Enterprise Research Institute

n 

n 

www.deri.ie

Disk-based stream processing ¨ 

Big windows

¨ 

Big linked data sets

Distributed stream processing ¨ 

Some general distributed stream processing platform/ systems –  Borealis, StreamBase, IBM Stream Spheres, etc –  S4, Storm, Kafka, etc

¨ 

Use black-box approach delegate processing è How to deal with the over head and restriction of optimization?

¨ 

Use whitebox approach è which physical processing can be reused from such platform/system? Can it be better?

Summary Digital Enterprise Research Institute

www.deri.ie

n 

What is Linked Stream Data?

n 

Data models for Linked Stream Data

n 

Query operators and query languages

n 

How to build a Linked Stream Processing Engine

n 

Comparisons and analysis of State-Of-The-Art systems

n 

Open challenges ¨  ¨  ¨  ¨ 

Serialization of RDF-based stream elements Optimization & Scheduling How to measure and compare performances Early-stage & Lack of functionalities