Entropy and Convergence Rates: the test-function and the information ...

1 downloads 0 Views 158KB Size Report
Entropy and Convergence Rates: the test-function and the information-theoretic approach. Willem Kruijer (Paris Dauphine), joint work with Aad van der.
Entropy and Convergence Rates: the test-function and the information-theoretic approach Willem Kruijer (Paris Dauphine), joint work with Aad van der Vaart (VU Amsterdam)

Rennes, 29 August 2008

Notation

X = (X1 , . . . , Xn ) ∼ q(x) =

n Y

qi (xi )

i=1

The convergence rate is the fastest sequence n → 0 such that n

1X 2 q d (pi , qi ) ≥ 2n |X1 , . . . , Xn ) → Π(p : n i=1

Alternatively, the rate is said to be n if Z EX ∼q

n

1X 2 d (pi , qi )dΠ(p | X ) . 2n . n i=1

0.

Notation (2)

I

For 0 < β < 1, Z ρβ (p1 , p2 ) =

p1β p21−β dµ = EX ∼p2 (

p1 (X ))β ≤ 1 p2

denotes the Hellinger affinity between p1 and p2 . I

The Renyi-affinity is defined as Dβ (p2 |p1 ) = − log ρβ (p1 , p2 )

I

D1/2 (p2 |p1 ) ≥ h2 (p1 , p2 ).

Obtaining rates (1) Let the packing number D(, A, d) of a set A ⊂ P be defined as the cardinality of the largest -separated set of points in A.  2 Define KL(q, n ) = {p : Eq log qp ≤ 2n , Eq log qp ≤ 2n }.

Theorem (Ghosal, Ghosh and van der Vaart, 2000) Let n → 0 be a sequence such that nn 2 → ∞. Suppose that for a semi-metric d and constants k1 and k2 , log D(n , {p : n < d(p, q) < 2n }, d) ≤ k1 n2n  2 Πn KL(q, n ) ≥ e−k2 nn .

(1)

Then for M > 0 sufficiently large, Eq n Π(p : d(p, q) ≤ Mn | X1 , . . . , Xn ) −→ 1.

(2)

Obtaining rates (2) Theorem (Zhang, 2006) For β ∈ (0, 1), γ ≥ 1, and λ0 = Z EX

γ−1 γ−β

∈ (0, 1),

 1 Re Dβ (q | p)dΠ(p | X ) ≤ (γ − β)k1 + γ(1 + k2 ) 2n , (3) n

provided that for a partition P = ∪j Pj hX i 0 log Π(Pj )λ rn (Pj ) ≤ k1 n2n ,

(4)

j

Π {p ∈ P : DKL (q k p) < n2n }



2

≥ e−k2 nn ,

where the upper-bracketing radius of Pj ⊂ P is defined as Z rn (Pj ) = sup p(x)dµ(x). p∈Pj

How to verify (4) ? How different are these theorems ?

(5)

Heuristics: L1 -brackets I

Let P ⊂ Hα be a class of densities on a compact interval

I

Estimation of q ∈ P: can we obtain n = n−α/(1+2α) ?

I

Choose Pj = [lj , uj ], L1 -brackets of size 2n .

I

j = 1, . . . , N(2n ), where N(2n ) = N[ ] (2n , P, L1 ). i hP λ0 r (P ) ≤ log r (P ) + log N (2 , P, L ). Π(P ) log n n j j 1 j [] n j

I I

Indications that it is not bounded by a multiple of n2n = n1/(1+2α) : N[ ] (2n , P, L1 ) ≥ N(2n , P, L1 ) log N(2n , P, L1 )



log N(2n , P, k

 · k∞ ) ≈

1 2n

1

α

≈ n2/(1+2α)

Introducing the distance between the Pj and q For δ ∈ (0, 1], redefine the upper-bracketing radius: !δ

Z rδ,n (Pj ) =

sup p(x)

q 1−δ (x)dµ(x).

(6)

p∈Pj

The entropy condition hX i 0 log Π(Pj )λ rn (Pj ) ≤ k1 n2n ,

(7)

j

can be replaced by hX i 0 log Π(Pj )δλ rδ,n (Pj ) ≤ k1 n2n .

(8)

j

Further improvements are possible, but for regression models with Gaussian or exponential-errors this is good enough.

proof of Theorem 2

From the information inequality it follows that if γ > 1, 0 < β < 1, λ0 = (γ − β)/(γ − 1) ∈ (0, 1) and X ∼ q, Z 1 Re D (q | p)dΠ(p | X ) EX n β Z   γ ≤ − log exp −DKL (q k p) dΠ(p) n ! 1/λ0 Z  (γ − β)λ0 p + EX log (X ) dΠ(p) . n q

(9)

Localizing the entropy term λ0 EX log n

Z 

p (X ) q

1/λ0

Z 

1 ≤ log EX δn

! dΠ(p)

!δλ0 1/λ0 p (X ) dΠ(p) q

 X 1 ≤ log EX  Π(Pj ) δn j

X 1 0 ≤ log EX Π(Pj )δλ δn j

=

1 log δn

X j

0

!1/λ0 δλ0

sup p∈Pj

p (X ) q

p sup (X ) q p∈Pj

Π(Pj )δλ rδ,n (Pj ).

 !δ

Fixed design Gaussian regression Qn

− f0 (xi ))

I

Given covariates x1 , . . . , xn , let q(y ) =

I

p = pf = (φσ (y1 − f (x1 )), . . . , φσ (yn − f (xn ))), with f ∈ F. P kf1 − f2 kn := 2σ12 n ni=1 (f1 (xi ) − f2 (xi ))2

I

i=1 φσ (yi

Theorem (Ghosal and Van der Vaart (2007)) Assume that  2 Π {f ∈ F : kf − f0 kn < 2n } ≥ e−k1 nn ,

(10)

and that there exists a nondecreasing function N() such that  log N n , {n < kf − f0 kn < 2n }, k · kn ≤ N(n ) ≤ c2 n2n . (11) Then

Z EY

kf − f0 k2n dΠ(f | Y ) . 2n .

Proof (application of Theorem 2)

The Kullback-Leibler divergence and Renyi-entropy between pf and pf0 = q are multiples of kf − f0 kn . To verify the entropy condition, choose the following subsets: F0 = {f ∈ F : kf − f0 kn < n } For k ≥ 1, cover Fk = {f ∈ F : k n ≤ kf − f0 kn < 2k n } with k · kn -balls of radius Lk n , where 0 < L < 12 . Removing the overlapping parts, we find a partition F0 , Fk ,j , k ≥ 1, j = 1, . . . , Nk .

... If Fk ,j is contained in B(fk ,j , Lk n , k · kn ), there exists a C > 0 s.t. Z Y n p n p Y rδ,n (Fk ,j ) = φσ (yi − f0 (xi )) supf ∈Fk ,j φσ (yi − f (xi ))dy i=1



i=1

exp{−Cnk 2 2n },

see also Le Cam (1983, 1984 ??). log

hX

Π(Pj )

δλ0

i

rδ,n (Fk ,j ) ≤ log

rδ,n (Fk ,j )

k ≥0 j=1

j

≤ log N(k n ) + log

X k ≥0

≤ log N(n ) + log

Nk XX

X k ≥0

sup rδ,n (Fk ,j ) j

exp{−Cnk 2 2n } ≤ k1 n2n .

Misspecification ˜ = argminp∈P DKL (q k p), and If q ∈ / P, define q ∗ rδ,n (A)

Z  =

supp∈A p(x) ˜ (x) q

δ q(x)dµ(x).

Theorem For β ∈ (0, 1) and γ ≥ 1, let λ0 = log

hX

γ−1 γ−β

∈ (0, 1). Assume that 0

∗ Π(Pj )δλ rδ,n (Pj )

i

≤ k1 n2n ,

j

Z Π {p ∈ P :

 ˜ q 2 q(x) log (x)dµ(x) < n2n } ≥ e−k2 nn , p

where {Pj } is an arbitrary countable cover of P. Then Z  1 Re ˜ | p)dΠ(p | X ) ≤ (γ − β)k1 + γ(1 + k2 ) 2n . EX Dβ (q n

Work in progress

A different discretization: R

I

Given {Pj }j∈J , define pj (X ) =

I

Apply (9) to the model {pj }j∈J with prior {Π(Pj )}j∈J .

I

The posterior is {Π(Pj |X )}j∈J .

I

Take the supremum and infimum over the convex hulls of the Pj ’s, to achieve sub-additivity.

Pj

p(X )dΠ(X )/Π(Pj ).

Conclusion I

Often, optimal rates can only be obtained if the entropy condition is ’localized’.

I

For comparison with the result of Ghosal, Ghosh and van der Vaart (2000), a different discretization is necessary.

References [1] Kleijn, B.J.K. and van der Vaart, A.W. (2006) Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics 34(2), 837–877. [2] Zhang, T. (2006) Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory 52 1307–1321. [3] Zhang, T. (2006) From -entropy to KL-entropy: analysis of minimum information complexity density estimation. Annals of Statistics 34(5), 2180–2210.