Creating Probabilistic Databases from Imprecise Time-Series ... - MICS

3 downloads 8465 Views 518KB Size Report
Sep 6, 2011 -
Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland

NCCR MICS Workshop 6th September, 2011

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

1 / 15

Outline Probability distribution p(R) showing Alice’s position y

raw_values

time

x

y

1 2 : :

1.1 1.3 : :

2.3 2.1 : :

room 2

room 1

room 3

S. Sathe, H. Jeung, K. Aberer (2011)

room 4

μ

room 4

prob_view

time room probability

time = 1 3σ area as a reasonable boundary y room 1 time = 2

?

1 1 1 1 2 2 2 2

x room 2

1 2 3 4 1 2 3 4

0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3

p(R) dR

x

room4 ∩ 3σ area

EPFL, Switzerland

2 / 15

Outline raw_values

Probability distribution p(R) showing Alice’s position y

time

x

y

1 2 : :

1.1 1.3 : :

2.3 2.1 : :

room 2

room 1

room 3

room 4

μ

room 4

Dynamic Density Metrics

1 1 1 1 2 2 2 2

x room 2

1 2 3 4 1 2 3 4

0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3

p(R) dR

x

room4 ∩ 3σ area

Approximating Gaussian distributions using σ–cache

Measure of Quality Efficiently creating probabilistic views S. Sathe, H. Jeung, K. Aberer (2011)

prob_view

time room probability

time = 1 3σ area as a reasonable boundary y room 1 time = 2

?

Parameter setting under provable guarantees Experiments

EPFL, Switzerland

2 / 15

Problem Setting ˆ t2

values

S tH1

rt

pt(Rt )

rt-1 t-H-1

rt t-1

t

time

Dynamic Density Metric

alues

H , the dynamic density metric infers time-dependent probability Given St−1 rˆt) ) = t p (R ) R =r t distributions pt (Rt ) at time t, H where Rt is t at random variable p t( associated Rt S with rt . p t( t 1 rˆt u

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

3 / 15

t-H-1

t-1

GARCH Metric

time

t

pt(Rt ) ~ N(rˆt ,σˆ t ) 2

rˆt

values

S tH1

rˆ t) t= R ( pt r t) t= R ( pt

rt t-H-1

t-1

t

time

rˆt is modeled using an ARMA model σ ˆt2 is modeled using a GARCH model Thus pt (Rt ) is a N (ˆ rt , σ ˆt2 ). We refer to this approach as ARMA-GARCH S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

4 / 15

Quality of Dynamic Density Metrics

ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH

S. Sathe, H. Jeung, K. Aberer (2011)

ˆ rt ARMA ARMA ARMA Kalman Filter

EPFL, Switzerland

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

5 / 15

Quality of Dynamic Density Metrics

ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH

ˆ rt ARMA ARMA ARMA Kalman Filter

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

Problem: The true density pˆt (Rt ) is not observable

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

5 / 15

Quality of Dynamic Density Metrics ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH

ˆ rt ARMA ARMA ARMA Kalman Filter

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

Indirect Method Suppose p1 (R1 ), . . . , pT (RT ) are the inferred densities and let zt = P (Rt ≤ rt ) then zt is uniformly distributed between (0, 1) when pt (Rt ) = pˆt (Rt ) [Deibold et. al.]. v u 1 uX d{U (z), Q (z)} = t (U (x) − Q (x))2 , (1) Z

Z

Z

Z

x=0

where UZ (z) is the ideal uniform cdf between (0, 1) and QZ (z) is the observed cdf of zt . We call d{UZ (z), QZ (z)} the density distance. S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

5 / 15

Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t = 1 AND t = 1 AND t t

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

8 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

8 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

8 / 15

Constraint-Aware Caching Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint ρλ remains unchanged Pt ( Rt ; rˆt , ˆ t ) 2

Pt ' ( Rt ' ; rˆt ' ,ˆ t ) 2

Δ

Δ

a' b' ab rˆt' rˆt a'=rˆt'+λΔ (2011)b'=rˆt'+(λ+1)Δ a=rˆt+λΔ b=rˆt+(λ+1)Δ S. Sathe, H. Jeung, K. Aberer EPFL, Switzerland

8 / 15

Guaranteeing Distance Constraint We use the Hellinger distance denoted H[·, ·] as a distance measure. 0 ≤ H ≤ 1.

Theorem: Distance Constraint Given a user-defined distance constraint H0 , we guarantee that H[pt (Rt ), pt0 (Rt0 )] ≤ H0 , if σ ˆt0 ≤ ds · σ ˆt and σ ˆt0 > σ ˆt where the parameter ds is chosen as any value satisfying, q 4 1 + 1 − 1 − H0 2 . ds ≤ 2 1 − H0 2 We call ds the ratio threshold. Example Suppose H0 = 0.2, then ds ≤ 1.5 Choose, say, ds = 1.4 then if S. Sathe, H. Jeung, K. Aberer (2011)

σ ˆ t0 σ ˆt

≤ ds then H [pt (Rt ), pt0 (Rt0 )] ≤ 0.2

EPFL, Switzerland

9 / 15

Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σt ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

10 / 15

Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σtˆ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache

Q 

d s  min (ˆ t )

- cached values cache memory

d s  m i n (ˆ t ) 1 d s  m i n (ˆ t ) 2

n

Δ

Find dqs · min(ˆ σt ) such that dqs · min(ˆ σt ) ≤ σ ˆt0 < dq+1 · min(ˆ σt ) s S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

10 / 15

σ–cache: Features

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t

Suggest Documents