information-based data stream summary

0 downloads 0 Views 2MB Size Report
hereafter we discuss only sampling-based summaries ... information-based summaries aim at suppressing this bias .... sharp steps in the bursts ..... Page 67 ...
information-based data stream summary

Fabrice Clérot, Pascal Gouzien Orange Labs

France Telecom Group

agenda

2



data-stream summary



information-based summary



performance with a constant memory constraint



performance with a time-varying memory constraint



related work : reglo



including weights in the compression rate



conclusion



references

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

agenda

3



data-stream summary



information-based summary



performance with a constant memory constraint



performance with a time-varying memory constraint



related work : reglo



including weights in the compression rate



conclusion



references

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

data stream



an infinite sequence of events



event : – event.timestamp = t ∈ T – event.data = X ∈ X – data space X unspecified at this point – denoted X(t) but not necessarily a "time series"



"minimal" assumptions – events are observed in increasing timestamp order – events are not lost

4

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

generic data stream summary [collectif midas] 

a data structure designed so as to keep "as much information as possible" on the stream under – memory usage constraints – computation time constraints – for the on-line maintenance of the summary – for the off-line answer to queries

5

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

generic data stream summary [collectif midas] 

"as much information as possible" : – allow the computation of the (approximate) answer to any query on the past of the stream – allow the (approximate) density estimation of TxX would reach the target – allow the computation of error bounds on the answer to any query (with respect to the memory and cpu constraints)

6

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

generic data stream summary



hereafter we discuss only sampling-based summaries – computationaly fast – memory usage constraint naturally addressed – resample if necessary – samples are ok for approximate density estimation – elements kept in the sample are given a weight inversely proportional to the sampling rate they experienced – query processing on a sample is as fast as on the original data



examples – random sampling – reservoir sampling [vitter] – streamsamp [csernel et al], [gabsi et al]

7

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

sampling-based summaries [gabsi, phd thesis] query 0

t-τ

t present

two main characteristics : • the volume of the sample • the weight of the elements of the sample as a function of • the current time (or size of the stream seen so far) : t • the "temporal depth" of the query : τ 8

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

sampling-based summaries [gabsi, phd thesis] t-τ

0

query t present

random sampling reservoir sampling streamsamp

9

volume

sample weight

(t)

(t, τ)

O(t)

constant

constant O(Log(t))

Orange Labs - Research & Development - IBDSS – 17/11/2010

O(t) independent of τ O(τ) independent of t

France Telecom Group

sampling-based summaries



streamsamp tries get the best on both worlds – slow increase in volume – sample weight increase with respect of the age of the sample measured from present time, independent of the duration of the stream



but streamsamp has a strong deterministic bias against the past



information-based summaries aim at suppressing this bias – sample where sampling degrades the signal as little as possible

10

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

sampling-based summaries t-τ

0

query t present

random sampling reservoir sampling streamsamp information based 11

volume

sample weight

(t)

(t, τ)

O(t)

constant

constant O(Log(t)) tunable

Orange Labs - Research & Development - IBDSS – 17/11/2010

O(t) independent of τ O(τ) independent of t optimal

France Telecom Group

agenda

12



data-stream summary



information-based summary



performance with a constant memory constraint



performance with a time-varying memory constraint



related work : reglo



including weights in the compression rate



conclusion



references

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

principle in the case of a constant memory constraint 

summary is built from S windows of F elements – volume is limited to S*F elements – windows are samples of a time period on the stream – windows are ordered accordingly

W5

W4

W3

W2

W1

0

13

W0 t



incoming stream events are stored in an input window



when the input window W0 is full, room is made by merging two adjacent windows

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

merging of two adjacent windows W1

W2 

stratified sampling with respect to the sample weights in each window – F*w1/(w1+w2) elements from W1 – F*w2/(w1+w2) elements from W2



these elements form a new window W* of size F



the sample weight of each sample in the resulting window is the sum of the sample weights – w* = w1 + w2

W* 14

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

merging strategy W(i)

W(i+1)

classifier classification performance



merge undistinguishable windows !



two adjacent windows are considered as two different labels



use a classifier to learn the label from the data



the worse the classification performance, the more undistinguishable the windows are – merge the window pair with the minimum classification performance

Perf(i) 15

Orange Labs - Research & Development - IBDSS – 17/11/2010

France Telecom Group

merging strategy W(i)

W(i+1)



Perf(i) is unchanged if W(i) and W(i+1) are not merged



when W(0) is full – compute Perf(0) – find j = argmini=0…S-1 Perf(i) – merge W(j) and W(j+1) into W* – W(j+1)  W* – for i