Sep 6, 2011 -
Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland
NCCR MICS Workshop 6th September, 2011
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
1 / 15
Outline Probability distribution p(R) showing Alice’s position y
raw_values
time
x
y
1 2 : :
1.1 1.3 : :
2.3 2.1 : :
room 2
room 1
room 3
S. Sathe, H. Jeung, K. Aberer (2011)
room 4
μ
room 4
prob_view
time room probability
time = 1 3σ area as a reasonable boundary y room 1 time = 2
?
1 1 1 1 2 2 2 2
x room 2
1 2 3 4 1 2 3 4
0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3
p(R) dR
x
room4 ∩ 3σ area
EPFL, Switzerland
2 / 15
Outline raw_values
Probability distribution p(R) showing Alice’s position y
time
x
y
1 2 : :
1.1 1.3 : :
2.3 2.1 : :
room 2
room 1
room 3
room 4
μ
room 4
Dynamic Density Metrics
1 1 1 1 2 2 2 2
x room 2
1 2 3 4 1 2 3 4
0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3
p(R) dR
x
room4 ∩ 3σ area
Approximating Gaussian distributions using σ–cache
Measure of Quality Efficiently creating probabilistic views S. Sathe, H. Jeung, K. Aberer (2011)
prob_view
time room probability
time = 1 3σ area as a reasonable boundary y room 1 time = 2
?
Parameter setting under provable guarantees Experiments
EPFL, Switzerland
2 / 15
Problem Setting ˆ t2
values
S tH1
rt
pt(Rt )
rt-1 t-H-1
rt t-1
t
time
Dynamic Density Metric
alues
H , the dynamic density metric infers time-dependent probability Given St−1 rˆt) ) = t p (R ) R =r t distributions pt (Rt ) at time t, H where Rt is t at random variable p t( associated Rt S with rt . p t( t 1 rˆt u
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
3 / 15
t-H-1
t-1
GARCH Metric
time
t
pt(Rt ) ~ N(rˆt ,σˆ t ) 2
rˆt
values
S tH1
rˆ t) t= R ( pt r t) t= R ( pt
rt t-H-1
t-1
t
time
rˆt is modeled using an ARMA model σ ˆt2 is modeled using a GARCH model Thus pt (Rt ) is a N (ˆ rt , σ ˆt2 ). We refer to this approach as ARMA-GARCH S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
4 / 15
Quality of Dynamic Density Metrics
ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH
S. Sathe, H. Jeung, K. Aberer (2011)
ˆ rt ARMA ARMA ARMA Kalman Filter
EPFL, Switzerland
σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH
5 / 15
Quality of Dynamic Density Metrics
ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH
ˆ rt ARMA ARMA ARMA Kalman Filter
σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH
Problem: The true density pˆt (Rt ) is not observable
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
5 / 15
Quality of Dynamic Density Metrics ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH
ˆ rt ARMA ARMA ARMA Kalman Filter
σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH
Indirect Method Suppose p1 (R1 ), . . . , pT (RT ) are the inferred densities and let zt = P (Rt ≤ rt ) then zt is uniformly distributed between (0, 1) when pt (Rt ) = pˆt (Rt ) [Deibold et. al.]. v u 1 uX d{U (z), Q (z)} = t (U (x) − Q (x))2 , (1) Z
Z
Z
Z
x=0
where UZ (z) is the ideal uniform cdf between (0, 1) and QZ (z) is the observed cdf of zt . We call d{UZ (z), QZ (z)} the density distance. S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
5 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t = 1 AND t = 1 AND t t
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
8 / 15
Constraint-Aware Caching
Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t
Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
8 / 15
Constraint-Aware Caching
Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t
Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
8 / 15
Constraint-Aware Caching Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint ρλ remains unchanged Pt ( Rt ; rˆt , ˆ t ) 2
Pt ' ( Rt ' ; rˆt ' ,ˆ t ) 2
Δ
Δ
a' b' ab rˆt' rˆt a'=rˆt'+λΔ (2011)b'=rˆt'+(λ+1)Δ a=rˆt+λΔ b=rˆt+(λ+1)Δ S. Sathe, H. Jeung, K. Aberer EPFL, Switzerland
8 / 15
Guaranteeing Distance Constraint We use the Hellinger distance denoted H[·, ·] as a distance measure. 0 ≤ H ≤ 1.
Theorem: Distance Constraint Given a user-defined distance constraint H0 , we guarantee that H[pt (Rt ), pt0 (Rt0 )] ≤ H0 , if σ ˆt0 ≤ ds · σ ˆt and σ ˆt0 > σ ˆt where the parameter ds is chosen as any value satisfying, q 4 1 + 1 − 1 − H0 2 . ds ≤ 2 1 − H0 2 We call ds the ratio threshold. Example Suppose H0 = 0.2, then ds ≤ 1.5 Choose, say, ds = 1.4 then if S. Sathe, H. Jeung, K. Aberer (2011)
σ ˆ t0 σ ˆt
≤ ds then H [pt (Rt ), pt0 (Rt0 )] ≤ 0.2
EPFL, Switzerland
9 / 15
Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σt ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
10 / 15
Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σtˆ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache
Q
d s min (ˆ t )
- cached values cache memory
d s m i n (ˆ t ) 1 d s m i n (ˆ t ) 2
n
Δ
Find dqs · min(ˆ σt ) such that dqs · min(ˆ σt ) ≤ σ ˆt0 < dq+1 · min(ˆ σt ) s S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
10 / 15
σ–cache: Features
CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t