Entropy and Convergence Rates: the test-function and the information-theoretic approach. Willem Kruijer (Paris Dauphine), joint work with Aad van der.
Entropy and Convergence Rates: the test-function and the information-theoretic approach Willem Kruijer (Paris Dauphine), joint work with Aad van der Vaart (VU Amsterdam)
Rennes, 29 August 2008
Notation
X = (X1 , . . . , Xn ) ∼ q(x) =
n Y
qi (xi )
i=1
The convergence rate is the fastest sequence n → 0 such that n
1X 2 q d (pi , qi ) ≥ 2n |X1 , . . . , Xn ) → Π(p : n i=1
Alternatively, the rate is said to be n if Z EX ∼q
n
1X 2 d (pi , qi )dΠ(p | X ) . 2n . n i=1
0.
Notation (2)
I
For 0 < β < 1, Z ρβ (p1 , p2 ) =
p1β p21−β dµ = EX ∼p2 (
p1 (X ))β ≤ 1 p2
denotes the Hellinger affinity between p1 and p2 . I
The Renyi-affinity is defined as Dβ (p2 |p1 ) = − log ρβ (p1 , p2 )
I
D1/2 (p2 |p1 ) ≥ h2 (p1 , p2 ).
Obtaining rates (1) Let the packing number D(, A, d) of a set A ⊂ P be defined as the cardinality of the largest -separated set of points in A. 2 Define KL(q, n ) = {p : Eq log qp ≤ 2n , Eq log qp ≤ 2n }.
Theorem (Ghosal, Ghosh and van der Vaart, 2000) Let n → 0 be a sequence such that nn 2 → ∞. Suppose that for a semi-metric d and constants k1 and k2 , log D(n , {p : n < d(p, q) < 2n }, d) ≤ k1 n2n 2 Πn KL(q, n ) ≥ e−k2 nn .
(1)
Then for M > 0 sufficiently large, Eq n Π(p : d(p, q) ≤ Mn | X1 , . . . , Xn ) −→ 1.
(2)
Obtaining rates (2) Theorem (Zhang, 2006) For β ∈ (0, 1), γ ≥ 1, and λ0 = Z EX
γ−1 γ−β
∈ (0, 1),
1 Re Dβ (q | p)dΠ(p | X ) ≤ (γ − β)k1 + γ(1 + k2 ) 2n , (3) n
provided that for a partition P = ∪j Pj hX i 0 log Π(Pj )λ rn (Pj ) ≤ k1 n2n ,
(4)
j
Π {p ∈ P : DKL (q k p) < n2n }
2
≥ e−k2 nn ,
where the upper-bracketing radius of Pj ⊂ P is defined as Z rn (Pj ) = sup p(x)dµ(x). p∈Pj
How to verify (4) ? How different are these theorems ?
(5)
Heuristics: L1 -brackets I
Let P ⊂ Hα be a class of densities on a compact interval
I
Estimation of q ∈ P: can we obtain n = n−α/(1+2α) ?
I
Choose Pj = [lj , uj ], L1 -brackets of size 2n .
I
j = 1, . . . , N(2n ), where N(2n ) = N[ ] (2n , P, L1 ). i hP λ0 r (P ) ≤ log r (P ) + log N (2 , P, L ). Π(P ) log n n j j 1 j [] n j
I I
Indications that it is not bounded by a multiple of n2n = n1/(1+2α) : N[ ] (2n , P, L1 ) ≥ N(2n , P, L1 ) log N(2n , P, L1 )
≤
log N(2n , P, k
· k∞ ) ≈
1 2n
1
α
≈ n2/(1+2α)
Introducing the distance between the Pj and q For δ ∈ (0, 1], redefine the upper-bracketing radius: !δ
Z rδ,n (Pj ) =
sup p(x)
q 1−δ (x)dµ(x).
(6)
p∈Pj
The entropy condition hX i 0 log Π(Pj )λ rn (Pj ) ≤ k1 n2n ,
(7)
j
can be replaced by hX i 0 log Π(Pj )δλ rδ,n (Pj ) ≤ k1 n2n .
(8)
j
Further improvements are possible, but for regression models with Gaussian or exponential-errors this is good enough.
proof of Theorem 2
From the information inequality it follows that if γ > 1, 0 < β < 1, λ0 = (γ − β)/(γ − 1) ∈ (0, 1) and X ∼ q, Z 1 Re D (q | p)dΠ(p | X ) EX n β Z γ ≤ − log exp −DKL (q k p) dΠ(p) n ! 1/λ0 Z (γ − β)λ0 p + EX log (X ) dΠ(p) . n q
(9)
Localizing the entropy term λ0 EX log n
Z
p (X ) q
1/λ0
Z
1 ≤ log EX δn
! dΠ(p)
!δλ0 1/λ0 p (X ) dΠ(p) q
X 1 ≤ log EX Π(Pj ) δn j
X 1 0 ≤ log EX Π(Pj )δλ δn j
=
1 log δn
X j
0
!1/λ0 δλ0
sup p∈Pj
p (X ) q
p sup (X ) q p∈Pj
Π(Pj )δλ rδ,n (Pj ).
!δ
Fixed design Gaussian regression Qn
− f0 (xi ))
I
Given covariates x1 , . . . , xn , let q(y ) =
I
p = pf = (φσ (y1 − f (x1 )), . . . , φσ (yn − f (xn ))), with f ∈ F. P kf1 − f2 kn := 2σ12 n ni=1 (f1 (xi ) − f2 (xi ))2
I
i=1 φσ (yi
Theorem (Ghosal and Van der Vaart (2007)) Assume that 2 Π {f ∈ F : kf − f0 kn < 2n } ≥ e−k1 nn ,
(10)
and that there exists a nondecreasing function N() such that log N n , {n < kf − f0 kn < 2n }, k · kn ≤ N(n ) ≤ c2 n2n . (11) Then
Z EY
kf − f0 k2n dΠ(f | Y ) . 2n .
Proof (application of Theorem 2)
The Kullback-Leibler divergence and Renyi-entropy between pf and pf0 = q are multiples of kf − f0 kn . To verify the entropy condition, choose the following subsets: F0 = {f ∈ F : kf − f0 kn < n } For k ≥ 1, cover Fk = {f ∈ F : k n ≤ kf − f0 kn < 2k n } with k · kn -balls of radius Lk n , where 0 < L < 12 . Removing the overlapping parts, we find a partition F0 , Fk ,j , k ≥ 1, j = 1, . . . , Nk .
... If Fk ,j is contained in B(fk ,j , Lk n , k · kn ), there exists a C > 0 s.t. Z Y n p n p Y rδ,n (Fk ,j ) = φσ (yi − f0 (xi )) supf ∈Fk ,j φσ (yi − f (xi ))dy i=1
≤
i=1
exp{−Cnk 2 2n },
see also Le Cam (1983, 1984 ??). log
hX
Π(Pj )
δλ0
i
rδ,n (Fk ,j ) ≤ log
rδ,n (Fk ,j )
k ≥0 j=1
j
≤ log N(k n ) + log
X k ≥0
≤ log N(n ) + log
Nk XX
X k ≥0
sup rδ,n (Fk ,j ) j
exp{−Cnk 2 2n } ≤ k1 n2n .
Misspecification ˜ = argminp∈P DKL (q k p), and If q ∈ / P, define q ∗ rδ,n (A)
Z =
supp∈A p(x) ˜ (x) q
δ q(x)dµ(x).
Theorem For β ∈ (0, 1) and γ ≥ 1, let λ0 = log
hX
γ−1 γ−β
∈ (0, 1). Assume that 0
∗ Π(Pj )δλ rδ,n (Pj )
i
≤ k1 n2n ,
j
Z Π {p ∈ P :
˜ q 2 q(x) log (x)dµ(x) < n2n } ≥ e−k2 nn , p
where {Pj } is an arbitrary countable cover of P. Then Z 1 Re ˜ | p)dΠ(p | X ) ≤ (γ − β)k1 + γ(1 + k2 ) 2n . EX Dβ (q n
Work in progress
A different discretization: R
I
Given {Pj }j∈J , define pj (X ) =
I
Apply (9) to the model {pj }j∈J with prior {Π(Pj )}j∈J .
I
The posterior is {Π(Pj |X )}j∈J .
I
Take the supremum and infimum over the convex hulls of the Pj ’s, to achieve sub-additivity.
Pj
p(X )dΠ(X )/Π(Pj ).
Conclusion I
Often, optimal rates can only be obtained if the entropy condition is ’localized’.
I
For comparison with the result of Ghosal, Ghosh and van der Vaart (2000), a different discretization is necessary.
References [1] Kleijn, B.J.K. and van der Vaart, A.W. (2006) Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics 34(2), 837–877. [2] Zhang, T. (2006) Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory 52 1307–1321. [3] Zhang, T. (2006) From -entropy to KL-entropy: analysis of minimum information complexity density estimation. Annals of Statistics 34(5), 2180–2210.