Linked Stream Data Processing Part I: Basic Concepts & Modeling

3 downloads 0 Views 9MB Size Report
Digital Enterprise Research Institute www.deri.ie. ▫ Part I: Basic Concepts & Modeling (Josi). □ Linked Stream Data. □ Data models. □ Query Languages and ...
Digital Enterprise Research Institute

www.deri.ie

Linked Stream Data Processing Part I: Basic Concepts & Modeling Danh Le-Phuoc, Josiane X. Parreira, and Manfred Hauswirth DERI - National University of Ireland, Galway Reasoning Web Summer School 2012

© Copyright 2011 Digital Enterprise Research Institute. All rights reserved.

Enabling networked knowledge

Outline Digital Enterprise Research Institute

n 

n 

Part I: Basic Concepts & Modeling (Josi) ¨ 

Linked Stream Data

¨ 

Data models

¨ 

Query Languages and Operators

¨ 

Choices/Challenges when designing a Linked Stream Data processor

Part II: Building a Linked Stream Processing Engine (Danh) ¨ 

Analysis of available Linked Stream Processing Engines –  Design choices, implementation –  Performance comparison –  Open Challenges

www.deri.ie

Streams everywhere Digital Enterprise Research Institute

www.deri.ie

Application Domains Digital Enterprise Research Institute

www.deri.ie

Smart Cities Enterprise Environments

Telehealth

Sorry, I can’t understand you… Digital Enterprise Research Institute

n  n  n  n 

www.deri.ie

Heterogeneous data representations Lack of semantics A priori knowledge of data sources needed Disconnected

Integration Problem!

Semantic Web, Linked Data Digital Enterprise Research Institute

n 

n 

www.deri.ie

Semantic Web ¨ 

Collaborative movement to promote common data formats on the World Wide Web.

¨ 

Inclusion of semantic content in web pages

¨ 

From unstructured and semi-structured documents to a “Web of data”

Linked Data ¨ 

Best practices to represent, publish, link data on the Semantic Web

¨ 

Linked Data Cloud: collection of datasets that have been published in Linked Data format

Digital Enterprise Research Institute

LINKED STREAM DATA

www.deri.ie

Linked Stream Data Digital Enterprise Research Institute

www.deri.ie

Linked Stream Data Digital Enterprise Research Institute

www.deri.ie

Linked Stream Data Digital Enterprise Research Institute

n  n 

n  n 

Semantically enriched stream data Linked Stream Data examples ¨ 

W3C Semantic Sensor Network Incubator Group

¨ 

RDF wrappers for Twitter, Facebook, etc

Data integration, connects dynamic and static data Linked Data + DSMS ¨ 

Stream Data representation/processing different from standard RDF/SPARQL –  Temporal aspect, continuous query processing

¨ 

DSMS use relational storage model –  Efficient RDF processing requires heavy replication

www.deri.ie

Running example Digital Enterprise Research Institute

www.deri.ie

Running Example – Conference scenario Digital Enterprise Research Institute

n  n  n  n  n 

www.deri.ie

Tracking system (e.g RFID tags) : Stream data Attendees information (e.g. DBLP records, FOAF) Building information (e.g. layout, connections, room names) Different sources (no common schema) Linked data used as unified model

Running Example Digital Enterprise Research Institute

www.deri.ie

(Q1)  Inform a participant about the name and description of the location he currently is

PREFIX lv: http://deri.org/floorplan/ PREFIX foaf: http://xmlns.com/foaf/0.1/ SELECT ?locName ?locDesc FROM NAMED WHERE { STREAM [NOW] {?person lv:detectedat ?loc} GRAPH {?loc lv:name ?locName. ?loc lv:desc ?locDesc} ?person foaf:name ‘‘$Name$’’. }

Linked Stream Data Digital Enterprise Research Institute

n  n 

Linked Data principles applied to stream data Extensions to deal with the temporal aspects ¨ 

Data modeling

¨ 

Query languages

¨ 

Query operators

¨ 

System design and architectures

www.deri.ie

Digital Enterprise Research Institute

DATA MODELS, QUERY LANGUAGES AND OPERATORS

www.deri.ie

Linked Stream Data model Digital Enterprise Research Institute

n 

n 

www.deri.ie

Extends the definition of RDF nodes and RDF triples ¨ 

RDF node: I, B, and L, which are pair-wise disjoint infinite sets of Information Resource Identifiers (IRIs), blank nodes and literals

¨ 

RDF triple: (s, p, o) ∈ IB × I × IBL, where IL = I ∪ L, IB = I ∪ B and IBL = I ∪ B ∪ L

Stream element: RDF triple with temporal annotations ¨ 

Interval-based (e.g. ⟨:John :at :office,[7,9]⟩) – Streaming SPARQL

¨ 

Point-based (e.g. ⟨John :at :office,7⟩, ⟨:John :at :office,8⟩, ⟨:John :at :office,9⟩) – EP-SPARQL, C-SPARQL, SPARQLStream, CQELS

¨ 

Point-based (maybe) redundant, but instantaneous (more practical)

Linked Stream Data model Digital Enterprise Research Institute

n 

www.deri.ie

RDF Stream: bag of elements ⟨(s,p,o) : [t]⟩ ¨ 

(s,p,o) : RDF triple

¨ 

t : timestamp

¨ 

stream elements from stream S with timestamp ≤ t S≤t ={⟨(s,p,o):[t’]⟩ ∈ S | t’ ≤ t}

n 

n  n 

Non-stream data (RDF datasets) also need to follow the Linked Stream Data model to allow integration è Instantaneous RDF dataset: G(t) G(t) : set of RDF triples valid at time t, called instantaneous RDF dataset. RDF dataset : sequence G = [G(t)], t∈N, ordered by t. ¨ 

Static RDF dataset (Gs): G(t) = G(t+1) for all t ≥ 0

Query Operators y operators. The primitive operation on RDF stream and instant dataset is pattern matching which is extended from the triple pat QL semantics [90]. Each output of a pattern matching operator c n  Pattern operator (extended from mapping whichmatching is definedasasbasic partial functions. Let V be an infinit SPARQL) les disjoint from IBL, a partial function µ from V to IBL denoted Digital Enterprise Research Institute

¨ 

www.deri.ie

Mappings which are defined as partial functions

µ : V 7 ! IBL. where V is an infinite set of variables disjoint from IBL, he domain of µ, dom(µ), is the subset of V where µ is defined. Two ma and dom(μ) is the subset of V where μ is defined. . d µ2 are compatible, denoted as µ1 = µ2 if : n  Compatible mappings . µ1 = µ2 () 8x 2 dom(µ1 ) \ dom(µ2 ) ) µ1 (x) = µ2 (x)

r a given triple pattern ⌧ , the triple obtained by replacing variables ording to µ is denoted as µ(⌧ ). t ⌦1 and ⌦2 be two mapping sets. The join, union, di↵erence a join operators over ⌦1 and ⌦2 are defined as following:

Query Operators Digital Enterprise Research Institute

n 

Join, union, different and left outer-join follow mappings (Ω1 and Ω2 are mapping sets)

www.deri.ie

1

2

1

2

1

2

Three primitive operators on RDF dataset and RDF stream, namely, tri Query Operators ⌦1 ./ ⌦2 = (⌦1 ./ ⌦2 ) [ (⌦1 \ ⌦2 ) (12 tching pattern operator, window matching operator and sequential operat Digital Enterprise Research Institute www.deri.ie primitive on RDF dataset RDF stream, namely, tripl beThree defined from theoperators above operators. Similar and to SPARQL, the triple match matching pattern window matching operator and sequential tern on operator, an instantaneous RDF dataset at timestamp t is operator defined n  operator Triple matching operator

can be defined from the above operators. Similar to SPARQL, the triple matchin pattern operator RDF dataset at timestamp a [[P, on t]]Gan = instantaneous {µ | dom(µ) = var(P ) ^ µ(P ) 2 G(t)} t is defined (1

(13 ere P 2¨ (ITriple [ V )[[P, ⇥ t]] (IG[=VP{µ )∈(I∪V)×(I∪V)×(IL∪V) ⇥| dom(µ) (IL [ V=). var(P ) ^ µ(P ) 2 G(t)} pattern ! A window matching operator [[P, t]] RDF stream S is then defin S over an ¨  μ(P): triple obtained by replacing variables within P where P 2 (I [ V ) ⇥ (I [ V ) ⇥ (IL [ V ). extendingaccording the operator ! to μabove as follows:

A window matching operator [[P, t]]S over an RDF stream S is then define by extending operator above as follows: n  Window operator ! thematching 0 0

[[P, t]]S = {µ | dom(µ) = var(P ) ^ hµ(P ) : [t ]i 2 S ^ t 2 !(t)}

(1

! N 0 0 [[P, t]] = {µ | dom(µ) = var(P ) ^ hµ(P ) : [t ]i 2 S ^ t 2 !(t)} (14 S ere !(t) : N ! 2 is a function mapping a timestamp to a (possibly infinite)

timestamps. the flexibility to achoose between the di↵erent wind where !(t) : NThis ! 2Ngives is a function mapping timestamp to a (possibly infinite) se

ω(t): N → 2N : function mapping a timestamp to a (possibly of timestamps. This gives the flexibility to choose between the di↵erent window infinite) set of timestamps (N : set of natural numbers) ¨  ¨ 

ω(t) will depend on the type of the window (e.g. time-based, tuple-based)

max(0, t T )}, and a window that extracts only events happening at the current Query Operators time corresponds to !NOW (t) = {t}. A triple-based event matching pattern like the sequential operator www.deri.ie SEQ of Digital Enterprise Research Institute EP-SPARQL, denoted as )t , can be defined by using above operator notations as following : n  Sequential Operator . ! ! [[P1 )t P2 ]]! = {µ [ µ | µ 2 [[P , t]] ^ µ 2 [[P , t]] ^ µ = µ2 1 2 1 1 2 2 1 S S S ^hµ1 (P1) : [t01 ]i 2 S ^ hµ2 (P2 ) : [t02 ]i 2 S ^ t01 t02 }

(15)

Other temporal relations introduced in [126, 3, 7, 6] can be formalized similarly to the sequential operator.

AND, UNION, OPT, FILTER, AGG can be derived from operators introduced so far 3.3 Query languages n 

To define a descriptive query language, firstly, the basic query patterns need to be introduced to express the primitive operators, i.e, triple matching, window matching, and sequential operators. Then the composition of basic query patterns can be expressed by AND, OPT, UNION and FITLER patterns of SPARQL. These patterns are corresponding to operators in the Equations (9)(12). In [17], an aggregation pattern is denoted as A(va , fa , pa , Ga ), where va is

Query Languages Digital Enterprise Research Institute

n 

n 

Extensions of SPARQL grammar for continuous queries ¨ 

Few different languages have been proposed

¨ 

Clauses to handle streams and to add window operators

StreamingSPARQL: DatastreamClause, Window

www.deri.ie

Query Languages Digital Enterprise Research Institute

n 

C-SPARQL: FromStrClause, Window

n 

CQELS: StreamGraphPattern (IRIs for streams)

www.deri.ie

Query Example: 1 stream Digital Enterprise Research Institute

n 

n 

www.deri.ie

(Q1)  Inform a participant about the name and description of the location he just entered C-SPARQL SELECT ?locName ?locDesc FROM STREAM [NOW] FROM NAMED WHERE { ?person lv:detectedat ?loc. ?loc lv:name ?locName. ?loc lv:desc ?locDesc ?person foaf:name ‘‘$Name’’. }$ ’’. }

CQELS

SELECT ?locName ?locDesc FROM NAMED WHERE { STREAM [NOW] {?person lv:detectedat ?loc} GRAPH {?loc lv:name ?locName. ?loc lv:desc ?locDesc} ?person foaf:name ‘‘$Name $ ’’. }

Query Example: 2+ windows on streams Digital Enterprise Research Institute

www.deri.ie

(Q2)  Notify two people when they can reach each other from two different and directly connected (from now on called nearby) locations. n  Streaming SPARQL and C-SPARQL don’t allow multiple windows in one stream in their grammar ¨ 

n 

C-SPARQL solution: create two virtual streams

CQELS

CONSTRUCT {?person1 lv:reachable ?person2} FROM NAMED WHERE { STREAM [NOW] {?person1 lv:detectedat ?loc1} STREAM [RANGE 3s] {?person2 lv:detectedat ?loc2} GRAPH {?loc1 lv:connected ?loc2} }

Query Example: Stream as var Digital Enterprise Research Institute

n 

www.deri.ie

Different streams can provide the same pattern Q3: Name of location of people nearby the DERI building

n 

CQELS (queries all streams that provide “nearby” info) SELECT ?name ?locName FROM NAMED WHERE { STREAM ?streamURI [NOW] {?person lv:detectedat ?loc} GRAPH { ?streamURI lv:nearby :DERI_Building. ?loc lv:name ?locName. ?person foaf:name ?name. } }

Query Example: Timestamps Digital Enterprise Research Institute

n 

n  n 

EP-SPARQL and C-SPARQL allow functions to deal with timestamps ¨ 

Timestamp can be retrieved and bound to a variable

¨ 

Timestamp of a bound stream element can be retrieved

Q4: Return pairs of people that were detected in a location in consecutive times (in the last 30min) EP-SPARQL CONSTRUCT {?person2 lv:comesAfter ?person1} { SELECT ?person1 ?person2 WHERE { {?person1 lv:detectedat ?loc} SEQ {?person2 lv:detectedat ?loc} } FILTER (getDURATION()=85

Dstream SELECT Dstream(tagid) FROM RFIDstream [60 seconds]

Rstream SELECT Rstream(*) FROM RFIDstream [NOW] WHERE signalstrength>=85

Time Management Digital Enterprise Research Institute

n  n  n 

www.deri.ie

Timestamps are necessary to order stream elements Application timestamp (source) vs. system timestamp (DSMS) Input manager: buffers to order tuples, ensure they are processed in order ¨ 

Heartbeat (timestamp)

¨ 

Punctuation (pattern)

Time Management Digital Enterprise Research Institute

www.deri.ie

Query Evaluation Digital Enterprise Research Institute

n 

n 

www.deri.ie

Eager re-evaluation vs. period re-evaluation ¨ 

Eager: too expensive if update rate is high

¨ 

Periodic: might cause stale results

Query evaluation needs to handle two types of events ¨ 

Arrival of new stream elements

¨ 

Expiration of old stream elements

¨ 

Action upon events vary across operators, e.g. an arrival might generate a new result (join) or trigger the removal of an existing result (negation)

Query Evaluation Digital Enterprise Research Institute

n  n 

www.deri.ie

Arrivals are triggered by stream source Expiration needs to be handle by the query processor ¨ 

Timestamp

¨ 

Negative tuple: for a window of length wl, a tuple inserted at time t will generate a negative tuple at time t+wl Window length



Adding and evicting stream elements Digital Enterprise Research Institute

www.deri.ie





W2=[RANGE 5]





W1=[TRIPLES 2]



W3=[RANGE 5]



W1W 2









1

2

3

4

5

6

7



8

W 2W 3

Query Evaluation Digital Enterprise Research Institute

n 

2: Manage Operator i of “on Data Stateless operators:Survey processed theStream fly”Fig. (directly elimination (c), a on stream) ¨ 

n 

www.deri.ie

E.g. Selection, union.

Stateful operators: need to maintain processing states (probed at re-evaluation) ¨ 

E.g. window join, aggregation, duplication elimination, non-monotonic operators

Selection

Query Evaluation Digital Enterprise Research Institute

www.deri.ie

Fig. 2: Management Operator implementations Survey of Data Stream

n 

: select (c), triggers aggregation (d), and nega inelimination one input

Window join : new arrival probing on the other input

Window join

Query Evaluation Digital Enterprise Research Institute

n 

www.deri.ie

Linked Stream Data Pro

Aggregation

Fig. 2: Management Operator implementations : selection Data¨ Stream Expirations must be dealt with immediately

(a), window join (b elimination (c), aggregation (d), and negation (e).

¨ 

Time and space requirements depends on the aggregation function

n 

Distributive aggregates ¨ 

Computed incrementally, constant time/space requirements

¨ 

n 

E.g. COUNT, SUM, MAX, MIN

Algebraic aggregates ¨ 

Computed using values from distributive aggregates. Constant time/space requirements

¨ 

E.g. AVG (SUM/COUNT)

Aggregation

Query Evaluation

Linked Stream www.deri.ie

Digital Enterprise Research Institute

Fig. 2: Management Operatorconsumption implementationslinear : selection Survey of Data Stream n  Holistic aggregates: space to

input sizes ¨ 

n 

(a), wind elimination (c), aggregation (d), and negation (e).

E.g. TOP-k, COUNT DISTINCT

Duplicate elimination ¨ 

Distinct values are kept

¨ 

Expirations are handled eagerly

Duplicate elimination

Query Evaluation Digital Enterprise Research Institute

n 

Non-monotonic operators

www.deri.ie

Linked Stream Data Processing

Fig. 2: Management Operator implementations of Data Stream

11

: selection (a), window join (b), duplication eliminationresults (c), aggregation (d), and negation (e).no longer satisfy query ¨  Previous removed when they

¨ 

E.g. negation

¨ 

Negative tuples can be used

Negation

Memory Overflow Digital Enterprise Research Institute

n  n 

Some join operators already handles memory overflow by sending input partitions to disk. Use of secondary storage requires indexes ¨ 

n 

Expensive under high update rates

Alternative: Partition the data to make updates “local” ¨ 

Sort tuples chronologically

¨ 

Inserts in newer partition only

¨ 

Deletes in older partition only

¨ 

Problem: search is not efficient. Assumes insertion/ expiration order is the same –  Sub-indexes –  Doubly partitioned indexes

www.deri.ie

Query Optimization Digital Enterprise Research Institute

n 

www.deri.ie

Re-arrange query operators for more efficient execution ¨ 

Traditional selectivity estimates can’t be applied

¨ 

Alternative: join reordering based on update rates

CC

CC

CC

CC

./ ./

./ ./

./ ./

././

././

GG

./ ./[range 3s]3s] [range

[now ] [range 3s]3s] [now ] ] GG [now ] [range [now

./ ./

GG

[range 3s] 3s] [now [now] ] [range

././

[now] ] [now

[range3s] 3s] GG [range

Query Optimization Digital Enterprise Research Institute

n 

www.deri.ie

Adaptivity is key!

Adaptive Cost-basedmust Optimization ¨  Processor be ableAlgorithm to reorder query operators on the fly ¨ 

9%#"/%#")%) ,1.,-/)(:) 4(,,#.&-) 61$(")4&%",)

-@6(841/-) 23)4&%")6(,/,)

Changes in: –  operator costs (processing time), –  update rate, –  input selectivity

*+((,-).-,/) 01-23)4&%")

“Notify two people who are co-authors of a paper if they are in the same location (within the last 30 seconds)"

(p , µ[G ]), where f (p , [G ]) is the evaluation f (p , µ[G ]), where f (p , [G e a a a a a a a a a ]) is beQuery boundOptimization to a variable va are computed asthe 8i 2 Vere G,fM AX, M IN ) with parameters p over th AV G, M AX, M IN ) with parameters a (pa , [Ga ]) is the evaluation of the function tNof) with groups of values in µ[G ] is made of all set of groups of values in µ[G ] is m a a parameters pa ofover n  Operators routing (instead fixedthe querygroups plan tree)of va e subset of the mapping µ[G ] without dupli ¨  Eddies: estimate which operators are faster/more selective the subset of the mapping µ[G ] with a a values in µ[G ] is made of all the distinct tu a ¨  Overhead: migration of internal state of query plan From above query patterns, let P , P and P From above query patterns, let P , 1 2 1 mapping µ[G ] without duplicate rows. a n  Continuous query: multi-query optimization possible site ones. A declarative query can be compos posite ones. A declarative query b ¨  Better memory usage uery patterns, let P1 , P2 and P be basic can query ¨  Trade-offs exists (e.g. join -> selection vs. selection -> join) les: rules: clarative query can be composed recursively u Digital Enterprise Research Institute

www.deri.ie

σ σ [[P1 AND P21]]AND = [[PP12]]]]./=[[P 1. [[P [[P2 ]] 1 ]] ./ [[P2 ]] [[P OPT P ]] = [[P ]] ./ [[P ]] 2. [[P OPT P ]] = [[P ]] ./ [[P2 ]] 1 2 1 2 1 2 1 ] = [[P1 ]] ./ [[P2 ]] S σ σ]] S [[P UNION P ]] = [[P ]] [ [[P 3. [[P UNION P ]] = [[P ]] [ [[P2 ]] 1 2 1 2 1 2 ]] 2 1 ] = [[P1 ]] ./ [[P 1

2

1

SA

SB

P2 ]] = [[P1 ]] [ [[P2 ]]

SA

B

2

SA

B

Scheduling Digital Enterprise Research Institute

n  n 

www.deri.ie

Data first push into queues, then consumed by operators Scheduler determiners which data in which queue to process next ¨ 

Different scheduling strategies (e.g. round robin, arrival time, time slice)

¨ 

Choice depends on factors such as stream arrival patterns, max/avg output latency.