Digital Enterprise Research Institute www.deri.ie. â« Part I: Basic Concepts & Modeling (Josi). â¡ Linked Stream Data. â¡ Data models. â¡ Query Languages and ...
Digital Enterprise Research Institute
www.deri.ie
Linked Stream Data Processing Part I: Basic Concepts & Modeling Danh Le-Phuoc, Josiane X. Parreira, and Manfred Hauswirth DERI - National University of Ireland, Galway Reasoning Web Summer School 2012
© Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
Enabling networked knowledge
Outline Digital Enterprise Research Institute
n
n
Part I: Basic Concepts & Modeling (Josi) ¨
Linked Stream Data
¨
Data models
¨
Query Languages and Operators
¨
Choices/Challenges when designing a Linked Stream Data processor
Part II: Building a Linked Stream Processing Engine (Danh) ¨
Analysis of available Linked Stream Processing Engines – Design choices, implementation – Performance comparison – Open Challenges
www.deri.ie
Streams everywhere Digital Enterprise Research Institute
www.deri.ie
Application Domains Digital Enterprise Research Institute
www.deri.ie
Smart Cities Enterprise Environments
Telehealth
Sorry, I can’t understand you… Digital Enterprise Research Institute
n n n n
www.deri.ie
Heterogeneous data representations Lack of semantics A priori knowledge of data sources needed Disconnected
Integration Problem!
Semantic Web, Linked Data Digital Enterprise Research Institute
n
n
www.deri.ie
Semantic Web ¨
Collaborative movement to promote common data formats on the World Wide Web.
¨
Inclusion of semantic content in web pages
¨
From unstructured and semi-structured documents to a “Web of data”
Linked Data ¨
Best practices to represent, publish, link data on the Semantic Web
¨
Linked Data Cloud: collection of datasets that have been published in Linked Data format
Digital Enterprise Research Institute
LINKED STREAM DATA
www.deri.ie
Linked Stream Data Digital Enterprise Research Institute
www.deri.ie
Linked Stream Data Digital Enterprise Research Institute
www.deri.ie
Linked Stream Data Digital Enterprise Research Institute
n n
n n
Semantically enriched stream data Linked Stream Data examples ¨
W3C Semantic Sensor Network Incubator Group
¨
RDF wrappers for Twitter, Facebook, etc
Data integration, connects dynamic and static data Linked Data + DSMS ¨
Stream Data representation/processing different from standard RDF/SPARQL – Temporal aspect, continuous query processing
¨
DSMS use relational storage model – Efficient RDF processing requires heavy replication
www.deri.ie
Running example Digital Enterprise Research Institute
www.deri.ie
Running Example – Conference scenario Digital Enterprise Research Institute
n n n n n
www.deri.ie
Tracking system (e.g RFID tags) : Stream data Attendees information (e.g. DBLP records, FOAF) Building information (e.g. layout, connections, room names) Different sources (no common schema) Linked data used as unified model
Running Example Digital Enterprise Research Institute
www.deri.ie
(Q1) Inform a participant about the name and description of the location he currently is
PREFIX lv: http://deri.org/floorplan/ PREFIX foaf: http://xmlns.com/foaf/0.1/ SELECT ?locName ?locDesc FROM NAMED WHERE { STREAM [NOW] {?person lv:detectedat ?loc} GRAPH {?loc lv:name ?locName. ?loc lv:desc ?locDesc} ?person foaf:name ‘‘$Name$’’. }
Linked Stream Data Digital Enterprise Research Institute
n n
Linked Data principles applied to stream data Extensions to deal with the temporal aspects ¨
Data modeling
¨
Query languages
¨
Query operators
¨
System design and architectures
www.deri.ie
Digital Enterprise Research Institute
DATA MODELS, QUERY LANGUAGES AND OPERATORS
www.deri.ie
Linked Stream Data model Digital Enterprise Research Institute
n
n
www.deri.ie
Extends the definition of RDF nodes and RDF triples ¨
RDF node: I, B, and L, which are pair-wise disjoint infinite sets of Information Resource Identifiers (IRIs), blank nodes and literals
¨
RDF triple: (s, p, o) ∈ IB × I × IBL, where IL = I ∪ L, IB = I ∪ B and IBL = I ∪ B ∪ L
Stream element: RDF triple with temporal annotations ¨
Interval-based (e.g. ⟨:John :at :office,[7,9]⟩) – Streaming SPARQL
¨
Point-based (e.g. ⟨John :at :office,7⟩, ⟨:John :at :office,8⟩, ⟨:John :at :office,9⟩) – EP-SPARQL, C-SPARQL, SPARQLStream, CQELS
¨
Point-based (maybe) redundant, but instantaneous (more practical)
Linked Stream Data model Digital Enterprise Research Institute
n
www.deri.ie
RDF Stream: bag of elements ⟨(s,p,o) : [t]⟩ ¨
(s,p,o) : RDF triple
¨
t : timestamp
¨
stream elements from stream S with timestamp ≤ t S≤t ={⟨(s,p,o):[t’]⟩ ∈ S | t’ ≤ t}
n
n n
Non-stream data (RDF datasets) also need to follow the Linked Stream Data model to allow integration è Instantaneous RDF dataset: G(t) G(t) : set of RDF triples valid at time t, called instantaneous RDF dataset. RDF dataset : sequence G = [G(t)], t∈N, ordered by t. ¨
Static RDF dataset (Gs): G(t) = G(t+1) for all t ≥ 0
Query Operators y operators. The primitive operation on RDF stream and instant dataset is pattern matching which is extended from the triple pat QL semantics [90]. Each output of a pattern matching operator c n Pattern operator (extended from mapping whichmatching is definedasasbasic partial functions. Let V be an infinit SPARQL) les disjoint from IBL, a partial function µ from V to IBL denoted Digital Enterprise Research Institute
¨
www.deri.ie
Mappings which are defined as partial functions
µ : V 7 ! IBL. where V is an infinite set of variables disjoint from IBL, he domain of µ, dom(µ), is the subset of V where µ is defined. Two ma and dom(μ) is the subset of V where μ is defined. . d µ2 are compatible, denoted as µ1 = µ2 if : n Compatible mappings . µ1 = µ2 () 8x 2 dom(µ1 ) \ dom(µ2 ) ) µ1 (x) = µ2 (x)
r a given triple pattern ⌧ , the triple obtained by replacing variables ording to µ is denoted as µ(⌧ ). t ⌦1 and ⌦2 be two mapping sets. The join, union, di↵erence a join operators over ⌦1 and ⌦2 are defined as following:
Query Operators Digital Enterprise Research Institute
n
Join, union, different and left outer-join follow mappings (Ω1 and Ω2 are mapping sets)
www.deri.ie
1
2
1
2
1
2
Three primitive operators on RDF dataset and RDF stream, namely, tri Query Operators ⌦1 ./ ⌦2 = (⌦1 ./ ⌦2 ) [ (⌦1 \ ⌦2 ) (12 tching pattern operator, window matching operator and sequential operat Digital Enterprise Research Institute www.deri.ie primitive on RDF dataset RDF stream, namely, tripl beThree defined from theoperators above operators. Similar and to SPARQL, the triple match matching pattern window matching operator and sequential tern on operator, an instantaneous RDF dataset at timestamp t is operator defined n operator Triple matching operator
can be defined from the above operators. Similar to SPARQL, the triple matchin pattern operator RDF dataset at timestamp a [[P, on t]]Gan = instantaneous {µ | dom(µ) = var(P ) ^ µ(P ) 2 G(t)} t is defined (1
(13 ere P 2¨ (ITriple [ V )[[P, ⇥ t]] (IG[=VP{µ )∈(I∪V)×(I∪V)×(IL∪V) ⇥| dom(µ) (IL [ V=). var(P ) ^ µ(P ) 2 G(t)} pattern ! A window matching operator [[P, t]] RDF stream S is then defin S over an ¨ μ(P): triple obtained by replacing variables within P where P 2 (I [ V ) ⇥ (I [ V ) ⇥ (IL [ V ). extendingaccording the operator ! to μabove as follows:
A window matching operator [[P, t]]S over an RDF stream S is then define by extending operator above as follows: n Window operator ! thematching 0 0
[[P, t]]S = {µ | dom(µ) = var(P ) ^ hµ(P ) : [t ]i 2 S ^ t 2 !(t)}
(1
! N 0 0 [[P, t]] = {µ | dom(µ) = var(P ) ^ hµ(P ) : [t ]i 2 S ^ t 2 !(t)} (14 S ere !(t) : N ! 2 is a function mapping a timestamp to a (possibly infinite)
timestamps. the flexibility to achoose between the di↵erent wind where !(t) : NThis ! 2Ngives is a function mapping timestamp to a (possibly infinite) se
ω(t): N → 2N : function mapping a timestamp to a (possibly of timestamps. This gives the flexibility to choose between the di↵erent window infinite) set of timestamps (N : set of natural numbers) ¨ ¨
ω(t) will depend on the type of the window (e.g. time-based, tuple-based)
max(0, t T )}, and a window that extracts only events happening at the current Query Operators time corresponds to !NOW (t) = {t}. A triple-based event matching pattern like the sequential operator www.deri.ie SEQ of Digital Enterprise Research Institute EP-SPARQL, denoted as )t , can be defined by using above operator notations as following : n Sequential Operator . ! ! [[P1 )t P2 ]]! = {µ [ µ | µ 2 [[P , t]] ^ µ 2 [[P , t]] ^ µ = µ2 1 2 1 1 2 2 1 S S S ^hµ1 (P1) : [t01 ]i 2 S ^ hµ2 (P2 ) : [t02 ]i 2 S ^ t01 t02 }
(15)
Other temporal relations introduced in [126, 3, 7, 6] can be formalized similarly to the sequential operator.
AND, UNION, OPT, FILTER, AGG can be derived from operators introduced so far 3.3 Query languages n
To define a descriptive query language, firstly, the basic query patterns need to be introduced to express the primitive operators, i.e, triple matching, window matching, and sequential operators. Then the composition of basic query patterns can be expressed by AND, OPT, UNION and FITLER patterns of SPARQL. These patterns are corresponding to operators in the Equations (9)(12). In [17], an aggregation pattern is denoted as A(va , fa , pa , Ga ), where va is
Query Languages Digital Enterprise Research Institute
n
n
Extensions of SPARQL grammar for continuous queries ¨
Few different languages have been proposed
¨
Clauses to handle streams and to add window operators
StreamingSPARQL: DatastreamClause, Window
www.deri.ie
Query Languages Digital Enterprise Research Institute
n
C-SPARQL: FromStrClause, Window
n
CQELS: StreamGraphPattern (IRIs for streams)
www.deri.ie
Query Example: 1 stream Digital Enterprise Research Institute
n
n
www.deri.ie
(Q1) Inform a participant about the name and description of the location he just entered C-SPARQL SELECT ?locName ?locDesc FROM STREAM [NOW] FROM NAMED WHERE { ?person lv:detectedat ?loc. ?loc lv:name ?locName. ?loc lv:desc ?locDesc ?person foaf:name ‘‘$Name’’. }$ ’’. }
CQELS
SELECT ?locName ?locDesc FROM NAMED WHERE { STREAM [NOW] {?person lv:detectedat ?loc} GRAPH {?loc lv:name ?locName. ?loc lv:desc ?locDesc} ?person foaf:name ‘‘$Name $ ’’. }
Query Example: 2+ windows on streams Digital Enterprise Research Institute
www.deri.ie
(Q2) Notify two people when they can reach each other from two different and directly connected (from now on called nearby) locations. n Streaming SPARQL and C-SPARQL don’t allow multiple windows in one stream in their grammar ¨
n
C-SPARQL solution: create two virtual streams
CQELS
CONSTRUCT {?person1 lv:reachable ?person2} FROM NAMED WHERE { STREAM [NOW] {?person1 lv:detectedat ?loc1} STREAM [RANGE 3s] {?person2 lv:detectedat ?loc2} GRAPH {?loc1 lv:connected ?loc2} }
Query Example: Stream as var Digital Enterprise Research Institute
n
www.deri.ie
Different streams can provide the same pattern Q3: Name of location of people nearby the DERI building
n
CQELS (queries all streams that provide “nearby” info) SELECT ?name ?locName FROM NAMED WHERE { STREAM ?streamURI [NOW] {?person lv:detectedat ?loc} GRAPH { ?streamURI lv:nearby :DERI_Building. ?loc lv:name ?locName. ?person foaf:name ?name. } }
Query Example: Timestamps Digital Enterprise Research Institute
n
n n
EP-SPARQL and C-SPARQL allow functions to deal with timestamps ¨
Timestamp can be retrieved and bound to a variable
¨
Timestamp of a bound stream element can be retrieved
Q4: Return pairs of people that were detected in a location in consecutive times (in the last 30min) EP-SPARQL CONSTRUCT {?person2 lv:comesAfter ?person1} { SELECT ?person1 ?person2 WHERE { {?person1 lv:detectedat ?loc} SEQ {?person2 lv:detectedat ?loc} } FILTER (getDURATION()=85
Dstream SELECT Dstream(tagid) FROM RFIDstream [60 seconds]
Rstream SELECT Rstream(*) FROM RFIDstream [NOW] WHERE signalstrength>=85
Time Management Digital Enterprise Research Institute
n n n
www.deri.ie
Timestamps are necessary to order stream elements Application timestamp (source) vs. system timestamp (DSMS) Input manager: buffers to order tuples, ensure they are processed in order ¨
Heartbeat (timestamp)
¨
Punctuation (pattern)
Time Management Digital Enterprise Research Institute
www.deri.ie
Query Evaluation Digital Enterprise Research Institute
n
n
www.deri.ie
Eager re-evaluation vs. period re-evaluation ¨
Eager: too expensive if update rate is high
¨
Periodic: might cause stale results
Query evaluation needs to handle two types of events ¨
Arrival of new stream elements
¨
Expiration of old stream elements
¨
Action upon events vary across operators, e.g. an arrival might generate a new result (join) or trigger the removal of an existing result (negation)
Query Evaluation Digital Enterprise Research Institute
n n
www.deri.ie
Arrivals are triggered by stream source Expiration needs to be handle by the query processor ¨
Timestamp
¨
Negative tuple: for a window of length wl, a tuple inserted at time t will generate a negative tuple at time t+wl Window length
…
Adding and evicting stream elements Digital Enterprise Research Institute
www.deri.ie
−
W2=[RANGE 5]
W1=[TRIPLES 2]
−
W3=[RANGE 5]
−
W1W 2
−
−
€
1
2
3
4
5
6
7
€
8
W 2W 3
Query Evaluation Digital Enterprise Research Institute
n
2: Manage Operator i of “on Data Stateless operators:Survey processed theStream fly”Fig. (directly elimination (c), a on stream) ¨
n
www.deri.ie
E.g. Selection, union.
Stateful operators: need to maintain processing states (probed at re-evaluation) ¨
E.g. window join, aggregation, duplication elimination, non-monotonic operators
Selection
Query Evaluation Digital Enterprise Research Institute
www.deri.ie
Fig. 2: Management Operator implementations Survey of Data Stream
n
: select (c), triggers aggregation (d), and nega inelimination one input
Window join : new arrival probing on the other input
Window join
Query Evaluation Digital Enterprise Research Institute
n
www.deri.ie
Linked Stream Data Pro
Aggregation
Fig. 2: Management Operator implementations : selection Data¨ Stream Expirations must be dealt with immediately
(a), window join (b elimination (c), aggregation (d), and negation (e).
¨
Time and space requirements depends on the aggregation function
n
Distributive aggregates ¨
Computed incrementally, constant time/space requirements
¨
n
E.g. COUNT, SUM, MAX, MIN
Algebraic aggregates ¨
Computed using values from distributive aggregates. Constant time/space requirements
¨
E.g. AVG (SUM/COUNT)
Aggregation
Query Evaluation
Linked Stream www.deri.ie
Digital Enterprise Research Institute
Fig. 2: Management Operatorconsumption implementationslinear : selection Survey of Data Stream n Holistic aggregates: space to
input sizes ¨
n
(a), wind elimination (c), aggregation (d), and negation (e).
E.g. TOP-k, COUNT DISTINCT
Duplicate elimination ¨
Distinct values are kept
¨
Expirations are handled eagerly
Duplicate elimination
Query Evaluation Digital Enterprise Research Institute
n
Non-monotonic operators
www.deri.ie
Linked Stream Data Processing
Fig. 2: Management Operator implementations of Data Stream
11
: selection (a), window join (b), duplication eliminationresults (c), aggregation (d), and negation (e).no longer satisfy query ¨ Previous removed when they
¨
E.g. negation
¨
Negative tuples can be used
Negation
Memory Overflow Digital Enterprise Research Institute
n n
Some join operators already handles memory overflow by sending input partitions to disk. Use of secondary storage requires indexes ¨
n
Expensive under high update rates
Alternative: Partition the data to make updates “local” ¨
Sort tuples chronologically
¨
Inserts in newer partition only
¨
Deletes in older partition only
¨
Problem: search is not efficient. Assumes insertion/ expiration order is the same – Sub-indexes – Doubly partitioned indexes
www.deri.ie
Query Optimization Digital Enterprise Research Institute
n
www.deri.ie
Re-arrange query operators for more efficient execution ¨
Traditional selectivity estimates can’t be applied
¨
Alternative: join reordering based on update rates
CC
CC
CC
CC
./ ./
./ ./
./ ./
././
././
GG
./ ./[range 3s]3s] [range
[now ] [range 3s]3s] [now ] ] GG [now ] [range [now
./ ./
GG
[range 3s] 3s] [now [now] ] [range
././
[now] ] [now
[range3s] 3s] GG [range
Query Optimization Digital Enterprise Research Institute
n
www.deri.ie
Adaptivity is key!
Adaptive Cost-basedmust Optimization ¨ Processor be ableAlgorithm to reorder query operators on the fly ¨
9%#"/%#")%) ,1.,-/)(:) 4(,,#.&-) 61$(")4&%",)
-@6(841/-) 23)4&%")6(,/,)
Changes in: – operator costs (processing time), – update rate, – input selectivity
*+((,-).-,/) 01-23)4&%")
“Notify two people who are co-authors of a paper if they are in the same location (within the last 30 seconds)"
(p , µ[G ]), where f (p , [G ]) is the evaluation f (p , µ[G ]), where f (p , [G e a a a a a a a a a ]) is beQuery boundOptimization to a variable va are computed asthe 8i 2 Vere G,fM AX, M IN ) with parameters p over th AV G, M AX, M IN ) with parameters a (pa , [Ga ]) is the evaluation of the function tNof) with groups of values in µ[G ] is made of all set of groups of values in µ[G ] is m a a parameters pa ofover n Operators routing (instead fixedthe querygroups plan tree)of va e subset of the mapping µ[G ] without dupli ¨ Eddies: estimate which operators are faster/more selective the subset of the mapping µ[G ] with a a values in µ[G ] is made of all the distinct tu a ¨ Overhead: migration of internal state of query plan From above query patterns, let P , P and P From above query patterns, let P , 1 2 1 mapping µ[G ] without duplicate rows. a n Continuous query: multi-query optimization possible site ones. A declarative query can be compos posite ones. A declarative query b ¨ Better memory usage uery patterns, let P1 , P2 and P be basic can query ¨ Trade-offs exists (e.g. join -> selection vs. selection -> join) les: rules: clarative query can be composed recursively u Digital Enterprise Research Institute
www.deri.ie
σ σ [[P1 AND P21]]AND = [[PP12]]]]./=[[P 1. [[P [[P2 ]] 1 ]] ./ [[P2 ]] [[P OPT P ]] = [[P ]] ./ [[P ]] 2. [[P OPT P ]] = [[P ]] ./ [[P2 ]] 1 2 1 2 1 2 1 ] = [[P1 ]] ./ [[P2 ]] S σ σ]] S [[P UNION P ]] = [[P ]] [ [[P 3. [[P UNION P ]] = [[P ]] [ [[P2 ]] 1 2 1 2 1 2 ]] 2 1 ] = [[P1 ]] ./ [[P 1
2
1
SA
SB
P2 ]] = [[P1 ]] [ [[P2 ]]
SA
B
2
SA
B
Scheduling Digital Enterprise Research Institute
n n
www.deri.ie
Data first push into queues, then consumed by operators Scheduler determiners which data in which queue to process next ¨
Different scheduling strategies (e.g. round robin, arrival time, time slice)
¨
Choice depends on factors such as stream arrival patterns, max/avg output latency.