Creating Probabilistic Databases from Imprecise Time-Series ... - MICS

Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland

NCCR MICS Workshop 6th September, 2011

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

1 / 15

Outline Probability distribution p(R) showing Alice’s position y

raw_values

time

x

y

1 2 : :

1.1 1.3 : :

2.3 2.1 : :

room 2

room 1

room 3


room 4

μ

room 4

prob_view

time room probability

time = 1 3σ area as a reasonable boundary y room 1 time = 2

?

1 1 1 1 2 2 2 2

x room 2

1 2 3 4 1 2 3 4

0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3

p(R) dR

x

room4 ∩ 3σ area

EPFL, Switzerland

2 / 15

Outline raw_values

Probability distribution p(R) showing Alice’s position y

time

x

y

1 2 : :

1.1 1.3 : :

2.3 2.1 : :

room 2

room 1

room 3

room 4

μ

room 4

Dynamic Density Metrics

1 1 1 1 2 2 2 2

x room 2

1 2 3 4 1 2 3 4

0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3

p(R) dR

x

room4 ∩ 3σ area

Approximating Gaussian distributions using σ–cache

Measure of Quality Efficiently creating probabilistic views S. Sathe, H. Jeung, K. Aberer (2011)

prob_view

time room probability

time = 1 3σ area as a reasonable boundary y room 1 time = 2

?

Parameter setting under provable guarantees Experiments

EPFL, Switzerland

2 / 15

Problem Setting ˆ t2

values

S tH1

rt

pt(Rt )

rt-1 t-H-1

rt t-1

t

time

Dynamic Density Metric

alues

H , the dynamic density metric infers time-dependent probability Given St−1 rˆt) ) = t p (R ) R =r t distributions pt (Rt ) at time t, H where Rt is t at random variable p t( associated Rt S with rt . p t( t 1 rˆt u


EPFL, Switzerland

3 / 15

t-H-1

t-1

GARCH Metric

time

t

pt(Rt ) ~ N(rˆt ,σˆ t ) 2

rˆt

values

S tH1

rˆ t) t= R ( pt r t) t= R ( pt

rt t-H-1

t-1

t

time

rˆt is modeled using an ARMA model σ ˆt2 is modeled using a GARCH model Thus pt (Rt ) is a N (ˆ rt , σ ˆt2 ). We refer to this approach as ARMA-GARCH S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

4 / 15

Quality of Dynamic Density Metrics

ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH


ˆ rt ARMA ARMA ARMA Kalman Filter

EPFL, Switzerland

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

5 / 15

Quality of Dynamic Density Metrics

ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH



Problem: The true density pˆt (Rt ) is not observable


EPFL, Switzerland

5 / 15

Quality of Dynamic Density Metrics ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH



Indirect Method Suppose p1 (R1 ), . . . , pT (RT ) are the inferred densities and let zt = P (Rt ≤ rt ) then zt is uniformly distributed between (0, 1) when pt (Rt ) = pˆt (Rt ) [Deibold et. al.]. v u 1 uX d{U (z), Q (z)} = t (U (x) − Q (x))2 , (1) Z

Z

Z

Z

x=0

where UZ (z) is the ideal uniform cdf between (0, 1) and QZ (z) is the observed cdf of zt . We call d{UZ (z), QZ (z)} the density distance. S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

5 / 15

Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t = 1 AND t = 1 AND t t


EPFL, Switzerland

8 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint


EPFL, Switzerland

8 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint


EPFL, Switzerland

8 / 15

Constraint-Aware Caching Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint ρλ remains unchanged Pt ( Rt ; rˆt , ˆ t ) 2

Pt ' ( Rt ' ; rˆt ' ,ˆ t ) 2

Δ

Δ

a' b' ab rˆt' rˆt a'=rˆt'+λΔ (2011)b'=rˆt'+(λ+1)Δ a=rˆt+λΔ b=rˆt+(λ+1)Δ S. Sathe, H. Jeung, K. Aberer EPFL, Switzerland

8 / 15

Guaranteeing Distance Constraint We use the Hellinger distance denoted H[·, ·] as a distance measure. 0 ≤ H ≤ 1.

Theorem: Distance Constraint Given a user-defined distance constraint H0 , we guarantee that H[pt (Rt ), pt0 (Rt0 )] ≤ H0 , if σ ˆt0 ≤ ds · σ ˆt and σ ˆt0 > σ ˆt where the parameter ds is chosen as any value satisfying, q 4 1 + 1 − 1 − H0 2 . ds ≤ 2 1 − H0 2 We call ds the ratio threshold. Example Suppose H0 = 0.2, then ds ≤ 1.5 Choose, say, ds = 1.4 then if S. Sathe, H. Jeung, K. Aberer (2011)

σ ˆ t0 σ ˆt

≤ ds then H [pt (Rt ), pt0 (Rt0 )] ≤ 0.2

EPFL, Switzerland

9 / 15

Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σt ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache


EPFL, Switzerland

10 / 15

Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σtˆ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache

Q 

d s  min (ˆ t )

- cached values cache memory

d s  m i n (ˆ t ) 1 d s  m i n (ˆ t ) 2

n

Δ

Find dqs · min(ˆ σt ) such that dqs · min(ˆ σt ) ≤ σ ˆt0 < dq+1 · min(ˆ σt ) s S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

10 / 15

σ–cache: Features

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t

Creating Probabilistic Databases from Imprecise Time-Series ... - MICS

Creating Probabilistic Databases from Imprecise Time-Series ... - MICS

Suggest Documents

Creating Probabilistic Databases from Imprecise Time ... - Saket Sathe

CONFERENCE: Creating Probabilistic Databases from ... - Google Sites

Creating Probabilistic Databases from Duplicated Data

Creating probabilistic databases from information ... - CSE, IIT Bombay

Probabilistic Databases

Mining Sequential Patterns from Probabilistic Databases

Discover Probabilistic Knowledge from Databases ... - Semantic Scholar

Probabilistic Databases - DBLab

Answering Imprecise Queries over Autonomous Web Databases

Imprecise probabilistic query answering using ... - Springer Link

TimeSeries Minitab.pdf

Database Theory Column Probabilistic Databases

Semistructured Probabilistic Databases - Semantic Scholar

Making massive probabilistic databases practical

Creating and Annotating Affect Databases from Face and Body Display

Answering Imprecise Queries over Web Databases - VLDB Endowment

Static Analysis of Programs with Imprecise Probabilistic ... - HAL-Inria

slides - MICS

Probabilistic Databases - Dan Suciu - Morgan Clayman.pdf - DBLab

On A Theory of Probabilistic Deductive Databases

Approximate Lineage for Probabilistic Databases University of

Efficient Query Evaluation on Probabilistic Databases

Consensus Answers for Queries over Probabilistic Databases

Regression Databases: Probabilistic Querying ... - GMU CS Department