Scalable multiscale density estimation
arXiv:1410.7692v1 [stat.ME] 28 Oct 2014
Ye Wang Duke University
[email protected]
Antonio Canale Universit`a degli studi di Torino e Collegio Carlo Alberto
[email protected]
Abstract Although Bayesian density estimation using discrete mixtures has good performance in modest dimensions, there is a lack of statistical and computational scalability to high-dimensional multivariate cases. To combat the curse of dimensionality, it is necessary to assume the data are concentrated near a lower-dimensional subspace. However, Bayesian methods for learning this subspace along with the density of the data scale poorly computationally. To solve this problem, we propose an empirical Bayes approach, which estimates a multiscale dictionary using geometric multiresolution analysis in a first stage. We use this dictionary within a multiscale mixture model, which allows uncertainty in component allocation, mixture weights and scaling factors over a binary tree. A computational algorithm is proposed, which scales efficiently to massive dimensional problems. We provide some theoretical support for this geometric density estimation (GEODE) method, and illustrate the performance through simulated and real data examples.
1
Introduction
Let yi = (yi1 , . . . , yiD )T , for i = 1, . . . , n, be a sample from an unknown distribution having support in a subset of } < (1 − a) for d > 2 log{b/(1 − a)}/ log(1/a), where d∞ (Ωs,h , Ωds,h ) is defined as kΩs,h − Ωds,h k∞ . kAk∞ calculates the 1 maximum absolute row sum of the matrix A, b = E(σs2 ) and a = E( τs,h,1 ).
Scalable multiscale density estimation
Figure 7: A 4 level binary tree decomposition of a parabola using METIS, with the black rectangular denoting the second level cells, the red denoting the third level cells and the green denoting the leaf cells. QK Proof. With a slight abuse of notation, we write us,h,k as u and let A = Let 4d = m=1 τs,h,m . ΨΣs,h ΨT − Ψd Σds,h (Ψd )T , 4d = {ai.j } and Ψ = {ψi,j }. Clearly, d∞ (Ωs,h , Ωds,h ) = max1≤i,j≤D |adi,j |, and PD adi,j = k=d+1 αk2 ψi,k ψj,k . By Cauchy-Schwartz inequality, |
D X
αk2 ψi,k ψj,k | ≤ max ( 1≤m≤D
k=d+1
D X
2 αk2 ψm,k ).
k=H+1
2 Since Ψ is orthonormal, we have ψi,j ≤ 1 for any i and j. Hence D X
d∞ (Ωs,h , Ωds,h ) ≤
αk2 .
k=d+1
For a fixed > 0, by Chebyshev’s inequalities p{d∞ (Ωs,h , Ωds,h ) ≤ }
≥ p
X D
αk2 ≤
k=d+1
= E p(
D X
αk2 ≤ |τ )
k=d+1
=
X D 1 − E p( αk2 > |τ ) k=d+1
PD E( k=d+1 αk2 |τ ) . ≥ 1−E
By design we have u ∼ Ga(0,1) (A + 1, 1) and u and σs2 are conditionally independent, hence E[(
1 − 1)σs2 |τ ] u
=
E[(
1 − 1)|τ ]E(σs2 ). u
Scalable multiscale density estimation
Then we have R1
1 E[( − 1)|τ ] u
=
=
e−1
= A Let γ(s, x) =
Rx 0
A
u e−u du (1/u − 1) Γ(A+1) = R 1 uA e−u du 0 Γ(A+1)
R1
1/u × uA e−u du −1 R1 uA e−u du 0 R 1 A−1 −u R1 1 A −u 1 u e du |0 + 0 A1 uA e−u du Au e 0 −1= −1 R1 R1 uA e−u du uA e−u du 0 0 0
R1 0
uA e−u du
−1+
0
1 . A
ts−1 e−t dt be the lower incomplete Gamma function. Note that, A A uA+1 e−u |10 + γ(A + 2, 1) A+1 A+1 A −1 A 1 1 −1 = e + e + γ(A + 3, 1) A+1 A+1 A+2 A+2 X K Γ(A + 1)2 −1 e + AΓ(A + 1)F (1; A + K, 1) = lim K→∞ Γ(A)Γ(A + k + 1)
Aγ(A + 1, 1)
=
k=1
∞ X
=
k=1 ∞ X
=
k=1
Γ(A + 1)2 e−1 Γ(A)Γ(A + k + 1) A e−1 (A + 1)(A + 2) . . . (A + k)
where F (x; a, b) is the cdf of Ga(a, b) and lima=∞ F (1; a, 1) = 0. Furthermore we have ∞ X k=1
Γ(A + 1)2 Γ(A)Γ(A + k + 1)
=
P∞
A k=1 (A+1)(A+2)...(A+k)
≥ 1/2,
and 1−
∞ X k=1
Γ(A + 1)2 Γ(A)Γ(A + k + 1)
≤1−
A A+1
≤
1 , A
thus we have e−1 A
R1
uA e−uh dus,h,k 0 s,h,k
−1+
1 A
=
= ≤ = Hence E[( u1 − 1)|τ ] ≤ 3/(
Qk
m=1 τs,h,m ).
D X k=d+1
1 Γ(A+1)2 k=1 Γ(A)Γ(A+k+1)
P∞
−1+
Γ(A+1)2 Γ(A)Γ(A+k+1) Γ(A+1)2 k=1 Γ(A)Γ(A+k+1)
P∞ 1 − k=1 P∞
+
1/A 1 + 1/2 A 3 . A
Based on this inequality, we have
1 2 E E[( − 1)σs |τ ] u
≤
PD
k=d+1
=
E
PD
Qk 3 m=1 τs,h,m
k=d+1
3bak ≤
E(σs2 )
3bad 1−a
1 A 1 A
Scalable multiscale density estimation 1 where b = E(σs2 ) and a = E( τs,h,1 ). Note that τs,h,m ∼ Exp[1,∞) (λ), thus a < 1. By Fubini’s theorem, P∞ P ∞ 1 −1)σs2 |τ ] . Now use inequality (1−x/2) > exp(−x) if 0 < x ≤ 1.5 E E( k=H+1 αk2 |τ ) = k=d+1 E E[( us,h,k
to get p{d∞ (Ωs,h , Ωds,h ) ≤ } ≥ exp{
−6bad } (1 − a)
if d > 2 log{b/(1 − a)}/ log(1/a). Hence, p{d∞ (Ωs,h , Ωds,h ) > } ≤ 1 − exp{
6bad −6bad }≤ , (1 − a) (1 − a)
since 6bad /{(1 − a)} < 1.
Theorem 4. Let s
L
f (yi ) =
L X 2 X
π ˜s,h ND (yi ; µs,h , Φs,h Σs,h ΦTs,h + σs2 I)
s=1 h=1
R R denote the approximation at scale L, let P (B) = B f (yi )dy and P L (B) = B f L (yi )dy, for all B ⊂