Logarithmic Sobolev Inequalities and the ... - Semantic Scholar

4 downloads 0 Views 251KB Size Report
Jun 18, 2007 - Logarithmic Sobolev inequalities have a wide range of applications and have ..... Now consider the multivariate normal distribution, with mean µ and co- .... Using this inequality and the Cauchy-Schwartz inequality we obtain.
Logarithmic Sobolev Inequalities and the Information Theory Christos P. Kitsos, Nikolaos K. Tavoularis Technological Educational Institute of Athens, Greece e-mails: [email protected], [email protected] June 18, 2007 Abstract

In this paper we present an overview on logarithmic Sobolev inequalities. These inequalities have become a subject of intense research activity during the past years, from analysis and geometry in finite and infinite dimension, to probability and statistical mechanics, and of course many others developments and applications are expected. We have divided this paper into three parts. The first part includes the initial forms of Poincar´e inequalities and Logarithmic Sobolev Inequalities for the Bernoulli and Gauss measure. In the second part the relationships between logarithmic Sobolev inequalities and the transportation of measures are considered, useful in statistics as well as geometry. The last part is a modern reading of entropy in information theory and of several links between information theory and the Euclidean form of the logarithmic Sobolev inequality. The genesis and the introduction of these inequalities can be traced back in the pioneering work of Shannon and Stam. Keywords : Poincar´e and Logarithmic Sobolev Inequalities, Bernoulli and Gauss measure, transportation inequalities, Shannon Entropy, Fisher’s information, uncertainty principles.

1

Poincar´ e inequalities and Logarithmic Sobolev Inequalities for the Bernoulli and Gauss measure

Logarithmic Sobolev inequalities have a wide range of applications and have been extensively studied (see for example [14], [13], [19], [2], [3], [4], [5], [16], [1], [15], [6] and the references therein). The fundamental example of the Bernoulli and Gaussian distributions is the starting point to Poincar´e inequalities as well as logarithmic Sobolev inequalities as they were defined by Gross [13] in the mid-seventies. We start mentioning the definitions of variance, energy and entropy as well as the Poincar´e inequality and logarithmic Sobolev inequality including the above quantities. Afterwards, we present these inequalities for Bernoulli measure and using the tensorisation property of variance and entropy we obtain the 1

best constants for the Poincar´e inequality and logarithmic Sobolev inequality for the gaussian measure on Rn . Let (E, F, µ) be a probability space. For each µ-integrable function f : E → R, the mean Eµ (f ) of f with respect to measure µ is Z def Eµ (f ) = f dµ. (1) E

Variance— The variance Varµ (f ) of a µ-integrable function f is defined to be ¡ ¢ Varµ (f ) := Eµ (f − Eµ (f ))2 = Eµ (f 2 ) − Eµ (f )2 .

(2)

The variational formula for the variance is: ¡ ¢ Varµ (f ) = inf Eµ (f − α)2 α∈R

The variance is always positive, null if and only if f is µ-a.e. constant (almost everywhere with respect to µ) and infinite if and only if f is not square µ-integrable. Entropy— The entropy Entµ (f ) of a µ-integrable positive function f is defined to be Entµ (f ) := Eµ (f logf ) − Eµ (f )logEµ (f ).

(3)

By the inequality uυ ≤ ulogu − u + eυ for u ≥ 0 and υ ∈ R the variational formula for the entropy is: Entµ (f ) = sup{Eµ (f g), Eµ (eg ) = 1}.

(4)

The quantity Entµ (f ) is finite if and only if f sup(0, log(f )) is µ-integrable. The formula (4) is equivalent to the following inequality, named entropy inequality: Entµ (f ) Eµ (f ) log Eµ (etg ) + , t t where f is every positive and square integrable function, g is a square integrable function and t > 0. Energy— In the case of E = Rn , the energy Eµ (f ) of a local integrable function f with ∇f ∈ L2 (Rn , µ) is defined to be Eµ (f g) ≤

2

¡ ¢ Eµ (f ) := Eµ |∇f |2 .

(5)

Hence the energy is positive and invariant under the translations. Poincar´ e Inequality— The measure µ satisfies the Poincar´e Inequality for a certain function class CP (E, µ), if there exists a constant c ∈ (0, +∞) such that Varµ (f ) ≤ cEµ (f ),

(6)

for each function f ∈ CP (E, µ). For example we may consider CP (E, µ) to be the space Sobolev H1 (Rn , µ) = {f ∈ L2 (Rn , µ) : Eµ (f ) < ∞}. The best constant cp (µ) for the Poincar´e Inequality is defined to be µ cp (µ) :=

½ inf

Eµ (f ) , f ∈ CP (E, µ), f not µ − a.e. constant Varµ (f )

¾¶−1 .

(7) Logarithmic Sobolev Inequality— The measure µ satisfies the Logarithmic Sobolev Inequality for a certain function class CLS (E, µ), if there exists a constant c ∈ (0, +∞) such that Entµ (f 2 ) ≤ cEµ (f ),

(8)

for each function f ∈ CLS (E, µ). Also here we may consider CLS (E, µ) to be the space Sobolev H1 (Rn , µ). The best constant cp (µ) for the Logarithmic Sobolev Inequality is defined to be µ cLS (µ) :=

½ inf

Eµ (f ) , f ∈ CLS (E, µ), f not µ − a.e. constant Entµ (f 2 )

Since, Entµ (f 2 ) = sup{Eµ (f 2 g), Eµ (eg ) = 1} we have cLS (µ) = sup{c(g), Eµ (eg ) = 1} where 3

¾¶−1 . (9)

½ c(g) := sup

¾ Eµ (f 2 g) , f ∈ CLS (E, µ), Eµ (f ) > 0 . Eµ (f )

Bernoulli measure— If E = {0, 1} the Bernoulli measure βp of E with the parameter p ∈ (0, 1) is the following probability measure βp := pδ0 + qδ1

(10)

where q = 1 − p and δα is the Dirac measure at α. Energy for Bernoulli measure— We find Eβp (f ) = pf (0) + qf (1) and the energy is defined to be Eβp (f ) = pq|f (0) − f (1)|2 . A simple calculation gives Varβp (f ) = Eβp (f ) that leads to the Poincar´e Inequality for Bernoulli measure: Theorem 1.1 (Poincar´e Inequality for Bernoulli measure) Varβp (f ) ≤ Eβp (f )

(11)

i.e. cp (βp ) = 1. Next we give the sharp Logarithmic Sobolev Inequality for Bernoulli measure. Theorem 1.2 (Logarithmic Sobolev Inequality for Bernoulli measure) The best constant for the inequality Entβp (f 2 ) ≤ cLS Eβp (f )

(12)

is ( cLS =

if p = 21 log(q)−log(p) , otherwise. q−p

2,

Remarks (I) The constant cLS is a concave function of the parameter p. It diverges to +∞ as p tends to 0 and has minimum 2 for p = 1/2. (II) The constant depends only on the parameter p. Therefore, considering E = {a, b} and βp := pδa + qδb we have the same constant for the inequality. In this case the energy is defined Eβp (f ) = pq|f (b) − f (a)|2 . Proof 4

By symmetry we can restrict in the case 0 < p ≤ 12 . We use the variational formula cLS (βp ) = sup{c(g), Eβp (eg ) = 1} where ½ c(g) := sup

¾ Eβp (f 2 g) , f ∈ CLS (E, βp ), Eβp (f ) > 0 . Eβp (f )

We set α = g(0) and b = g(1). We have Eβp (eg ) = peα + qeb = 1 = e0 . and hence αb < 0. Note that Eβp (|f |) ≤ Eβp (f ). So, we suppose f ≥ 0 We set x = f (0) with x > 0. ½ ¾ pαx2 + qb pqc(g) = sup , x > 0 and x 6= 1 . (x − 1)2 The supremum is attained for x = −qb/pα and it is ³p

q ´−1 , c(g) = + b α Therefore ³ np q o´−1 α b . cLS (pδ0 + qδ1 ) = inf + , pe + qe = 1 b α We set t = eα , s = eb = 1−pt q and define def

ϕ(t) =

p q + . logs logt

Since αb < 0 cLS (pδ0 + qδ1 ) = (inf {ϕ(t), t ∈ (0, 1) ∪ (1, 1/p)})−1 . We extend ϕ setting ϕ(0) = − logp q , ϕ(1) = 21 and ϕ(p− = − logq p . Remark that 1 is a local minimum if and only if p = 12 . Also we have q p2 − .¤ ϕ (t) = qs(logs)2 t(logt)2 0

5

Also a limit development at t = 1 gives ¢ ¡ 1 p−q 1 − 3p + 3p2 2 3 . ϕ(t) = + (t − 1) + (t − 1) + O (t − 1) 2 12q 24q 2 Since log s · log s = αb < 0, ϕ0 (t) = 0 if and only if the system ½ √ √ p t log t + q s log s = 0 p(1 − t) + q(1 − s) = 0 admits one solution, non null. Hence we obtain √

√ t log t s log s = . 1−t 1−s

(13)



log x The function u(x) = x1−x is zero at 0 and +∞. Also u(x) = u(x−1 ). It is decreased on (0, 1) and increased on (0, +∞). (13) gives µ ¶ 1 − pt u(t) = u . q

◦ If p 6= 12 , ϕ0 (1) 6= 0 and ϕ0 (t) equals to zero only if t = pq . The function ϕ has a minimum at pq with value log(q) − log(p) . q−p ◦ If p = 21 , ϕ0 (1) = 0 and ϕ0 (t) equals to zero only if t = 1. The minimum of ϕ at t = 1 is 12 .¤

Tensorisation of variance and entropy— Proposition 1.1 Let (Ei , Fi , µi , i = 1, ..., n) n probability spaces and (En , F n , µn ) the product probability space. Then Varµn ≤

n X

Eµn (Varµi (f ))

i=1

6

(14)

and Entµn ≤

n X

Eµn (Entµi (f )).

(15)

i=1

Proof Let a function g defined on En such that Eµn (eg ) = 1. R g n n X X e dµ1 (x1 ) · · · dµi−1 (xi−1 ) def g= gi = g1 + log R g e dµ1 (x1 ) · · · dµi (xi ) i=1 i=2 R with g1 = g − log eg dµ1 (x1 ). Trivially Eµi (egi ) = 1, and hence using the variational formula (4) for µi : n X

Eµi (f gi ) ≤

i=1

n X

Entµi (f gi ).

i=1

On the other side n n n X X X Eµn (f g) = Eµn (f gi ) = Eµn (Eµi (f gi )) ≤ Eµ (Entµi (f )). i=1

(16)

i=1

(17)

i=1

Combining (16) and (17) we have the proposition.¤

Best constants for the Gaussian measure— Using the tensorisation property of variance and entropy we obtain the Poincar´e Inequality as well as the Logarithmic Sobolev Inequality for gaussian measure from the above inequalities for the Bernoulli measure. Let E = R. The gaussian probability measure is 2

dγ = (2π)−1/2 e−|x|

/2

dx.

(18)

Theorem 1.3 (Poincar´e Inequality for gaussian measure on R) For f ∈ H1 (R, γ): Varγ (f ) ≤ Eγ (f ) i.e. cp (γ) = 1. 7

(19)

Proof We consider Ei = {0, 1} and the probability measure 1 µi = β1/2 = (δ0 + δ1 ). 2 For F : En → R Proposition 2.3 gives: n

¢ ¡ 1X Varµn (F ) ≤ Eµn |Fi (1) − Fi (0)|2 . 4 i=1 We define the function Φn : {0, 1}n → R Pn def i=1 xi − n/2 p Φn (x) = , n/4 where the variable x = (x1 , x2 , ..., xn ). Applying the previous inequality for a function of the form f ◦ Φn , f ∈ H1 (R, γ) we have n

¡ ¢ 1X Varµn (f ◦ Φn ) ≤ Eµn |(f ◦ Φn )i (1) − (f ◦ Φn )i (0)|2 . 4 i=1

(20)

We denote Xi : x ∈ En 7→ xi ∈ R. We have Eβ1/2 (β1/2 ) = 12 and Varβ1/2 (β1/2 ) = 41 . By the central limit theorem (see Appendix) we have lim Varµn (f ◦ Φn ) = Varγ (f ).

n→∞ 1

The function f is in H (R, γ) so r |(f ◦ φn )i (1) − (f ◦ φn )i (0)| ≤

4 0 4 |f (φn,i (x))| + K n n

where φn,i (x) :=

x1 + ... + xi−1 + xi+1 + ... + xn − n/2 p . n/4

Let us denote ν (n) that is independent of i. Therefore 8

(21)

n X i=1

¡ ¢ 16K 16K 2 Eµn |(f ◦ φn )i (1) − (f ◦ φn )i (0)|2 ≤ 4Eν (n) (|f 0 |2 ) + √ Eν (n) (|f 0 |2 ) + n n 16K 3 16K 2 ≤ 4Eν (n) (|f 0 |2 ) + √ + . n n

Using again the central limit theorem lim sup n→∞

n X

¡ ¢ ¡ ¢ Eµn |(f ◦ φn )i (1) − (f ◦ φn )i (0)|2 ≤ 4Eγ |f 0 |2 .

(22)

i=1

Finally (20), (21), (22) give Varγ (f ) ≤ Eγ (f ).¤ Theorem 1.4 (Logarithmic measure on R) For f ∈ H1 (R, γ):

Sobolev

Inequality

Entγ (f 2 ) ≤ 2Eγ (f )

for

gaussian

(23)

i.e. cLS (γ) = 2. Proof The proof is a step by step transfer of the proof of Theorem 1.3 using the tensorisation property of entropy.¤

Let E = Rn and the gaussian probability measure on Rn dγ ⊗n (x) = (2π)−n/2 e−|x|

2

/2

dx.

The next theorem gives the best constants for the Poincar´e and Logarithic Sobolev Inequality for gaussian measure on Rn . Using the following formula 2

Eγ ⊗n (f ) = Eγ ⊗n (|∇f | ) =

n X i=1 9

Eγ ⊗n (|∂i f |2 )

=

n X

2

Eγ ⊗n (Eγ (|∂i f | ) =

i=1

n X

Eγ ⊗n (Eγ (Eγ (fi )).¤

i=1

we can deduce from 1.4 the Poincar´e and Logarithic Sobolev Inequality for gaussian measure on Rn . Corollary 1.1 (Poincar´e and Logarithic Sobolev Inequality for gaussian measure on Rn ) For f ∈ H1 (Rn , γ ⊗n ): Varγ ⊗n (f ) ≤ Eγ ⊗n (f )

(24)

Entγ ⊗n (f 2 ) ≤ 2Eγ ⊗n (f )

(25)

i.e. cp (γ ⊗n ) = 1 and

i.e. cLS (γ ⊗n ) = 2.

Poincar´ e and Logarithic Sobolev Inequality for the general gaussian law N (µ, Σ)— Now consider the multivariate normal distribution, with mean µ and covariance matrix Σ of the form ·

¸ 1 N (µ, Σ) = (2π)−n/2 |detΣ|−1/2 exp − < (x − µ)t , Σ−1 (x − µ) > . 2

(26)

with < a, b > being the inner product of the vectors a, b. In this general case of the gaussian measure the Poincar´e and Logarithic Sobolev Inequality are VarN (µ,Σ) (f ) ≤ EN (µ,Σ) (< Σ∇f, ∇f >) or VarN (µ,Σ) (f ) ≤ σΣ EN (µ,Σ) (|∇f |2 ) and EntN (µ,Σ) (f 2 ) ≤ 2EN (µ,Σ) (< Σ∇f, ∇f >) or 10

(27)

EntN (µ,Σ) (f 2 ) ≤ 2σΣ EN (µ,Σ) (|∇f |2 )

(28)

respectively.

Poisson measure— The Poisson measure πλ of the parameter λ > 0 is the probability measure on N defined by: def

πλ = e

−λ

∞ X λn n=0

n!

δn .

(29)

Theorem 1.5 (Modified Logarithmic Sobolev Inequality for Bernoulli measure βp of parameter 0 < p < 1) Let f be a positive function. Then for the Bernoulli measure βp of parameter 0 < p < 1 the following modified Logarithmic Sobolev Inequality Entβp ≤ pqEβp (Df D log f )

(30)

def

is valid where Df (x) = f (x + 1) − f (x). Proof Let f : {0, 1} → R be a strictly positive function and set α = f (0) and b = f (1). We define the function U of parameter p with def

U (p) = Entβp (f ) − pqEβp (Df D log f ) We have U 0 (0) = b(log α − log b) − (α − b). If α ≤ b: Z

b

b(log b − log α) = α

b ds ≥ (b − α), s

while if α ≥ b: Z b(log α − log b) = b 11

α

b ds ≤ α − b). s

In both cases we have U 0 (0) ≤ 0. Also U 0 (1) = α(log α − log b) − (α − b) ≥ 0. We have U ≤ 0 on [0, 1] if and only if U 0 (0) ≤ 0 ≤ U 0 (1) and the theorem is proved.¤ Using entropy tensorisation (Proposition 2.3) and the inequality (30) we have the following modified logarithmic Sobolev inequality for the probability measure πλ : Theorem 1.6 (Modified Logarithmic Sobolev Inequality for Poisson measure πλ of parameter λ) Entπλ (f ) ≤ λEπλ (Df D log f ) is valid, where Df (x) := f (x + 1) − f (x).

12

(31)

2

Logarithmic Sobolev Inequalities and Transportation

In this section of the paper we present the distance Tk (k = 0 or k ≥ 1) introduced by Kantorovich for a set of probability measure of a metric space E, d. This distance is named Kantorovich distance or Wasserstein distance and is a main tool for many domains of mathematics from statistics and probabilities to partial differential equations. Also the definition of transportation inequalities Tk is given. Let (E, d) be a metric space and note P0 the set of probability measure on (E, d). Definition 2.1 Let µ and ν be two elements of P0 and let P (µ, ν) be the set of probability measures of E × E such that if π ∈ P (µ, ν) and f , g are measurable functions then Z Z Z (f (x) + g(x))dπ(x, y) = df µ + gdν The application T0 on P0 × P0 is defined by: ½Z ¾ T0 (µ, ν) = inf Ix6=y (x, y)dπ(x, y); π ∈ P (µ, ν) . For k ≥ 1, if µ and ν two elements of Pk , the application Tk on Pk × Pk is defined by: µ Tk (µ, ν) =

½Z inf

¾¶1/k d(x, y)k dπ(x, y); π ∈ P (µ, ν) . k

Proposition 2.1 — T0 is a distance for the set Pk . — Let k ≥ 1 and µ, ν ∈ Pk . Then µ

½ µ ¶¾¶1/k d(X, Y )k Tk (µ, ν) = inf E k where the infimum is taken on the set of random variables X and Y with values in E admiting laws with respect to the measuses µ and ν respectively. Example 2.1 For every function f with probability density with respect to Bernoulli measure βp we have: 13

T0 (f dβp , dβp ) = p|f (0) − 1|, µ Tk (f dβp , dβp ) =

p|f (0) − 1| k

¶1/k , f or k ≥ 1.

Definition 2.2 Let µ be a measure of E, F. For a probability density f , we set ½Z ¾ kµkV T = sup f dµ where the supremum is running over all measurable functions f . Proposition 2.2 Let µ and ν two probability measures. Then 1 T0 = kµ − νkV T . (32) 2 Transportation Inequalities— As we have seen in the previous section the Poincar´e and Logarithmic Sobolev Inequalities represent the domination of the variance by the entropy and the energy respectively. A transportation inequality is the domination of the distance Tk by the square root of the entropy. The consideration of the relationship between T2 transportation inequalities and logarithmic Sobolev inequalities is the subject of the next theorems. Definition 2.3 Let µ ∈ Pk , (k = 0 or k ≥ 1) and c > 0. We say that a measure µ satisfies a transportation inequality Tk of constant c if for each probability function s Z q Tk (f sµ, dµ) ≤ c f log f dµ = cEntµ (f ). (33) The next theorem, named Csisz´ar-Kullback Theorem, gives the constant c for the T0 transportation inequality (see also (32) and has a number of applications to many mathematical regions. Csisz´ ar-Kullback Theorem—

14

Theorem 2.1 Let µ a probability measure. Then q kf dµ − dµkV T ≤ 2Entµ (f )

(34)

Proof For u ≥ 0, the Pinsker inequality gives 3(u − 1)2 ≤ (2u + 4)(u log u − u + 1). Using this inequality and the Cauchy-Schwartz inequality we obtain Z

Z r

2f + 4 p f log f − f + 1dµ 3 sZ sZ 2f + 4 ≤ dµ f log f dµ 3 s Z q 2 f log f dµ = 2Entµ (f ). ≤

|f − 1|dµ ≤

Taking account kf dµ − dµkV T =

R

|f − 1|dµ the result ensues.¤

The Otto-Villani theorem asserts that logarithmic Sobolev inequality implies the transportation inequality T2 : Otto-Villani Theorem— Theorem 2.2 Let Φ : Rn → R such that e−Φ be integrable. We define the probability measure µ

where Z =

R

µ(dx) = Z −1 e−Φ dx,

(35)

e−Φ dx. We suppose that the following inequality Entµ (f 2 ) ≤ 2cEµ (k∇f k2 )

(36)

is valid for f ∈ C 1 . Then µ satisfies the transportation inequality T2 of constant c. 15

The next proposition asserts that the transportation inequality T2 implies the Poincar´e inequality: Theorem 2.3 Let µ be the measure defined in Theorem 2.2. We suppose that µ satisfies the transportation inequality T2 of constant c. Then we have the Poincar´e inequality Z Varµ (f ) ≤ c |∇f |2 dµ, (37) for f belongs to the space of differential functions f with compact support i.e. f ∈ Cc1 . For more details about the above issues, see [1] and the references therein.

16

3

Entropy Inequalities and Information Theory

This part of the paper gives the relation of Fisher’s information (FI) measure with the entropy measure (EM) as well as a theoretical framework under the Euclidean form of the Logarithmic Sobolev Inequalities (LSI). It is known that there are several links between information theory and the Euclidean form of the Logarithmic Sobolev Inequalities. Shannon [17] and Stam [18] with their pioneering work contributed to the introduction of these inequalities. Logarithmic Sobolev Inequalities (LSI) can play an important role to develop bounds for information measure. Fisher’s parametric information [7], plays an important role to various statistical methods. The linear and non-linear optimal experimental design theory is a typical example, Ford, et.al.[8], Kitsos, et.al.[9], while the calibration problem has been tackled as an optimal experimental design, through an asymptotic approach to the (average per observation) information matrix, Kitsos and M¨ uller [11]. When sequential procedures are adopted for nonlinear, Kitsos [10] proved that we can introduce sequences which converge in mean square (m.s.) to that normal distribution in which the variance, related to Fisher’s information is minimized. Moreover, as far as the problem of carcinogenesis concern, for the class of Multistage Models, there is at least one iterative scheme which converge in m.s. to the p-th percentile of the distribution and this iterative scheme minimizes the entropy of the limiting design, Kitsos [12]. Several measures of information have been proposed in literature with various properties, which lead to their wide applications. A convenient classification to differentiate these measures is to classify them as: non-parametric, parametric and entropy-type measures of information. Let P and Q probability measures on the same space X and p, q their densities with respect to a common measure µ on that space. Then the KulbackLeibler Information (KLI) in X is defined as Z p(x) dµ(x). (38) KLI(P, Q) = p(x) log q(x) The parametric measures of information measure the amount of information about an unknown parameter θ, supplied by the data and are functions of θ. In the case of parametric families θ and δ let be from the parameter 17

space then ½ ¾ fX|Θ (x|θ) KLI(θ, δ) = Eθ log fX|Θ (x|δ)

(39)

The Fisher Information (F I) matrix of the measurement X relative to the parameter vector θ, defined as ¾ ∂ log(fθ (X)) FI(X; θ) = COV ∂θ µ ¶ µ ¶t Z 1 ∂fθ (x) ∂fθ (x) = · dx fθ (x) ∂θ ∂θ ½

(40)

where θ = (θ1 , ..., θm ) is the parameter vector from a (compact) subset of Rm . The set {fθ (x)} is a family of densities of X parameterized by θ, ∂/∂θ or ∇θ denotes the gradient (i.e. a column vector of partial derivatives) with respect to the parameters θ1 , ..., θm , {COV } denotes the m × m covariance matrix calculated related to the distribution of X and αt denotes the transpose of the vector α.. Here X may either be a single measurement or a vector of n measurements. Notice that Fisher’s information matrix equals also to ¡ ¢ FI(θ) = Eθ ∇θ log fθ (X) · ∇θ log fθ (X)t = Eθ |∇θ log fθ (X)|2 .

(41)

If θ is univariate trivially, ·

¸2 ∂ FI(θ) = Eθ log fθ (X) . (42) ∂θ In the context of information-theoretic inequalities, there appears a special form of the FI matrix, namely, the FI of a random vector with respect to a translation parameter ½

¾ ∂ FI(X) = F I(θ + X; θ) = COV log(f (X)) ∂X µ ¶ µ ¶t Z 1 ∂f (x) ∂f (x) = · dx f (x) ∂x ∂x 18

(43)

where f (x) is the density function of the vector X (f (x) is independent of θ), and F I(X) is a square matrix whose dimension equals of X. Unlike the general case (40), this form of the FI is a function of the density of the vector alone, and not of its parameterization. Also Fisher’s information of X is defined as J(X) = trace{FI(X)} . The advantages of KLI over FI is that it is not affected by changes in parameterization, it can be applied even if the distributions are not for the parametric family and no smooth conditions are needed as for FI. There is at least one connection between KLI and FI. Measures of entropy express the amount of information contained in a distribution, that is, the amount of uncertainty associated with the outcome of an experiment. The classical measures of this type are Shannon’s and R´enyi’s measure. Shannon Entropy of a finite discrete random variable— Consider a discrete random variable X = (x1 , x2 , ..., xn ) with probabilities p1 , p2 , ..., pn . A real positive number for each distribution (p1 , p2 , ..., pn ) that relates to the amount of uncertainty about an event is the information entropy H(n) . Shannon gave in H(n) (X) = H(n) (p1 , ..., pn ) the following properties: (1) H(n) ¡is continuous for each ¢ ¡ 1 variable ¢ pi , 1 1 (n) 1 (n+1) (2) H n , ..., n < H n+1 , ..., n+1 (3) For each (b1 , ..., bk ) such that b1 + ... + bk = n µ H

(n)

1 1 , ..., n n



µ =H

(k)

b1 bk , ..., n n

¶ +

k X bi i=1

n

µ H

(bi )

1 1 , ..., bi bi

¶ .

The function of the form Hb (p1 , ..., pn ) = −

n X

pi logb pi =

i=1

n X i=1

pi logb

1 , pi

satisfies the above three properties where logb is the logarithm of base b > 0 def and 0 logb 0 = 0. The function Hb is called the entropy of base b. It is null for the Dirac measure and maximal for a uniform law (by convexity) with value logb n. 19

We denote He (p1 , ..., pn ) = −Entµ (p), where µ is the counting measure for the set x1 , ..., xn and p is the function defined p(xi ) = pi .

Shannon Entropy of a continuous random variable— Let X be a continuous random variable with density f with respect to Lebesgue measure on Rn . Entropy of base b is defined by Z def Hb (X) = − f logb f dx = −E(logb f (X)). (44) Rn

So, Hb (X) = since logb f (X) = Also we have

H(X) log b

log f (X) log b .

H(X) = −Entdx (f ) = −EL(X) (log f ).

(45)

where L(X) denotes the law of X. It is clear that H(X) depends only on the law of X, hence we can consider the entropy H(X) of a probability law µ of Rn .

Entropy Properties— (I) H(c + X) = H(X) (II) H(αX) = H(X) + n log α where α is a real positive number.

Conditional Entropy— Let X and Y be two random vectors. The conditional entropy of X given Y is defined by 20

def

H(X|Y ) = H((X, Y )) − H(Y ).

(46)

Gaussian maximizes entropy— Proposition 3.1 Let the random vector X has zero mean and covariance K.Then 1 log((2πe)n |detK|) 2 with equality if and only if X ∼ N (o, K) H(X) ≤

(47)

Proof We have give the definition of the Kulback-Leibler Information KLI(f, g) or relative entropy Ent(f |g) of two densities f and g with respect to Lebesgue measure of Rn to be Z Z f (x) f (x) f (x) log g(x)dx = f (x) log dx. g(x) g(x) Rn g(x) Rn The function x log x is strick convex and hence KLI(f, g) ≥ 0 with equality if and only if f (x) = g(x) dx-almost everywhere. Let γK be the density of Gaussian N (0, K). Then Z 0 ≤ KLI(f |γK ) = −H(X) − f log γK . (48) Rn

But H(N (0, K)) = 12 log((2πe)n |K|) and both X and N (m, K) have the same covariance matrix K that gives Z Z f xi xj dxi dxj = Kij = γK xi xj dxi dxj n n R R RR and thus f A = γK A where A is a quadratic form. We can see that log γK is a quadratic form plus a constant and we have Z Z f log γK dx = γK log γK dx. (49) (48) and (48) give the result.¤

21

Theorem 3.1 Let n random vectors X1 , ..., Xn of density with respect to Lebesgue measure. Then H((X1 , ..., Xn )) ≤

n X

H(Xi ),

(50)

i=1

with equality if and only if the vectors are independent. Proof Let X and Y be two random vectors. The conditional entropy of X given Y is defined by def

H(X|Y ) = H((X, Y )) − H(Y ). If f (x, y), f1 (x), f2 (y) are the densities of (X, Y ), X and Y respectively then H(X) − H(X|Y ) = Ent(f (x, y)|f1 (x)f2 (y)) ≥ 0 and hence H(X|Y ) ≤ H(X), with equality if and only if X and Y are independent. On the other side, H((X1 , ..., Xn )) =

n X

H(Xi |(Xi−1 , ..., X1 )) ≤

i=1

n X

H(Xi ).¤

i=1

Mutual Information— Mutual information of two random vector is the quantity def

I(X, Y ) = H((X, Y )) − H(X|Y ) − H(Y |X) = I(Y, X). We have I(X, Y ) = H(X) + H(Y ) − H((X, Y )).

22

(51)

Reformulation of Logarithmic Sobolev Inequality with respect to Gauss measure— For t > 0, let N (0, tIn ) be the gaussian law of covariance tIn . Logarithmic Sobolev inequality (see (28)) gives ¯ ¢ EntN (0,tIn ) (f 2 ) ≤ 2tEN (0,tIn ) ¯∇f |2 ,

(52)

where f : Rn → R is a differentiable function. 2 Choosing f (x) = eπ|x| /2 g(x) in the above inequality we find Z

· ¸ Z n 2 2 |g(x)| log |g(x)| dx ≤ log |∇g(x)| dx . 2 eπn Rn Rn 2

2

Exponential Shannon Entropy — The Exponential Shannon Entropy of a random vector X of density f with respect to Lebesgue measure is defined by 2 def 1 N(X) = (53) e n H(X) . 2πe We can find N(N (0, K)) = (detK)1/n and the Proposition 3.1 gives N(X) ≤ (detK)1/n .

Fisher Information— The Fisher Information of a random vector X of density f with respect to Lebesgue measure is defined by Z ³p ´ def J(X) = 4 |∇ f |2 dx (54) This definition gives the Fisher Information as energy. Alternatively we have the following definition for the Fisher Information Z J(X) = EL(X) |∇ log f |2 = |∇ log f |2 f dx Z Z = |∇f |2 f −1 dx = ∇f · ∇ log f dx. where L(X) denotes the law of X. 23

(55)

After these the Logarithmic Sobolev Inequality (52) for g = written N(X)J(X) ≥ n.



f can be

(56)

Remark that for each α > 0, J(αX) = α−2 J(X) and N(αX) = α2 N(X), that assert (56) is invariant under dilations. More general, if A is an invertible matrix then N(AX) = |detA|2/n N(X).

Blachman-Stam Inequality— Theorem 3.2 Let a real number λ ∈ [0, 1] and X, Y two independent random vectors of Rn . Then ³√ ´ √ λJ(X) + (1 − λ)J(Y ) ≥ J λX + 1 − λY . (57)

Inequality for the Exponential Shannon Entropy— Theorem 3.3 For all independent random vectors X and Y of Rn with respect to Lebesgue measure, we have N(X + Y ) ≥ N(X) + N(Y ).

(58)

We have equality if and only if two random vectors are gaussian independent of the proportional covariance.

Debruijn Identity— Theorem 3.4 Let X a random vector of Rn and Z a gaussian standard vector of Rn independent of X.Then √ i 1 √ ∂ h H(X + tZ) = J(X + tZ). ∂t 2 24

(59)

Proof 2 The function (2πt)−n/2 e−|z| /2t is the density of Z. Let (Pt )t≥0 the semit group that is a generator of the operator 1/2∆ i.e. Pt = e 2 ∆ . We have 1 def 1/2∆f = lim (Pt − f ). t→0 t √ The density of X + tZ is Pt f , where f is the density of X and

Also H(X +



H(X +



tZ) = H(Pt f ).

tZ) = H(Pt f ) and

t 1 t 1 ∂t Pt f = ∂t e 2 ∆ f = ∆e 2 ∆ f = ∆(Pt f ). 2 2 A standard calculation gives

Z ∂t H(Pt f ) = − = −

1 2

1 ∆(Pt f )(log Pt f + 1)dx 2 Z (log Pt f )∆(Pt f )dx

√ 1 1 J(Pt f ) = J(X + tZ). 2 2 that proves the Debruijn Identity.¤ =

Remarks (I) We have ∂ ∂t (−Entµ (Pt f )) = − ∂Zt

Z Pt f log Pt f

1 1 = − ∆(Pt f ) log Pt f = 2 2 p = J(Pt f ) = 4Eµ ( Pt f ).

(II) Using Logarithmic Sobolev Inequality (56)

25

Z ∇Pt f ∇ log Pt f

1 2 H(Pt f ) 2 en ∂t H(Pt f ) 2πe n 1 n = N(Pt f )2J(Pt f ) ≥ = 1 2πen n

∂t N(Pt f ) =

(III) Integrating the above inequality we obtain the inequality of Theorem 3.3 N(X +

√ since N( tZ) = (IV) Also



√ tZ) ≥ N(X) + t = N(X) + N( tZ),

√ 1 n2 H( tZ) e 2πe

= t with the variable Y ∼ N (0, tIn ).

h √ i ∂t N(X + tZ)

|t=0

≥ N(Z) = 1

However using the Debruijn Identity h ∂t N(X +



i tZ)

|t=0

·

= = =

Combining two inequalities we (56).

¸ 1 2 H(X+√tZ) ∂t en 2πe |t=0 h √ i 1 2 H(X) 2 n e ∂t H(X + tZ) |t=0 2πe n 1 N(X)J(X) n obtain the Logarithmic Sobolev Inequality

Shannon-Stam Inequality— Theorem 3.5 Let X and Y two independent random vectors of Rn of density with respect to Lebesgue measure and a real number λ ∈ [0, 1]. Then √ √ H( λX + 1 − λY ) ≥ λH(X) + (1 − λ)H(Y ).

26

(60)

Theorem 3.6 Let X and Y two independent random vectors of Rn of density with respect to Lebesgue measure and a real number λ ∈ [0, 1]. Then e + Ye ), H(X + Y ) ≥ H(X

(61)

e and Ye two independent gaussian random vectors such that where X e H(X) = H(X)

and

H(Y ) = H(Ye ).

R´ enyi entropy— For a random variable X of density f ∈ Lp (Rn ) the R´enyi entropy of order p is defined to be Hp (X) =

1 p log E(f (X)p−1 ) = log kf kp . 1−p 1−p def

(62) def

The entropy Hp is continuous at p and we set H1 = H and H0 (X) = log |{f > 0}|, where |C| is the Lebesgue measure of C.

Theorem 3.7 Let 0 < r ≤ ∞and λ ∈ [0, 1]. Also let p and q such that 1/p0 = λ/r0 and 1/q 0 = (1 − λ)/r0 . If X and Y are two random vectors of Rn with the entropies Hp (X) and Hq (Y ) well defined, then √ √ Hr ( λX + 1 − λY ) − λHp (X) − (1 − λ)Hq (Y ) ≥ Hr (Z) − λHp (Z) − (1 − λ)Hq (Z), where Z is the standard gaussian random vector in Rn . Choosing r = 1, i.e. p = q = 1 we retrieve Theorem 3.5.

27

(63)

Uncertainty Principles— I.Cram´ er-Rao Uncertainty Principle— Sonar example— Consider x1 , .., xn observations of the independent random variables X1 , ..., Xn that for example give the position θ of a object on a axon (i.e. θ ∈ R). The question is to estimate the position θ choosing a function Y of x1 , ...xn that is called estimator. A natural choice is to consider the arithmetic average x1 + ... + xn . n This estimator has the advantage of having θ for expectation, since µ ¶ x1 + ... + xn E(Y ) = E = E(X1 ) = θ. n The calculation of this variance is : def

Y (x1 , ...xn ) =

(64)

(65)

Var(X1 ) σ 2 Var(Y ) = = . (66) n n Consider a measurable space Ω and a set of probability measures (µθ )θ∈Θ where Θ an open subset of Rn . In the precedent example (sonar example), the space Ω corresponds to Rn and µθ = N (θ, σ)⊗n , where N (θ, σ) is the gaussian law with mean θ and variance σ 2 . Also the likelihood Lθ is Lθ (x1 , ..., xn ) =

n Y

f (xi − θ),

(67)

i=1

where f is the density of the gaussian measure N (0, σ): ¶ µ 1 x2 f (x) = √ exp − 2 . 2σ 2πσ The integration on Ω with respect to the measure µ)θ is noted with Eθ and for each random variable Y Z Z Eθ (Y ) = Y dµθ = Y Lθ dµ. (68) Ω



A elementary property of Likelihood— We have 28

Z 0 = ∇θ Eθ (1) = ∇θ

µ

Z Lθ dµ =

∇θ Lθ dµ = Eθ

∇θ Lθ Lθ

¶ = Eθ (∇θ log Lθ ),

since Eθ (1) = 1. As we have refered above, the matrix of covariance of ∇θ log Lθ is called matrix of Fisher information FI(θ) = Eθ ((∇θ log Lθ ) · (∇θ log Lθ )> ) For the example of sonar a simple calculation gives FI(θ) = n/σ 2 .

Cram´ er-Rao inequality in Statistics— Theorem 3.8 Suppose that the function θ 7→ Lθ is differentiable on Θ, ∇θ log Lθ is square integrable for µθ and the matrix of Fisher information is invertible. Let RY a random variable of L2 (Ω, µθ ) for each θ ∈ Θ and R ∇θ ( Lθ · Y dµ) = (∇θ Lθ · Y )dµ. Then Varθ (Y ) ≥ ∇θ Eθ (Y )> · FI(θ)−1 · ∇θ Eθ (Y ).

(69)

Proof Let Y be a random variable that satisfies the hypotheses of the theorem. We have Z ∇θ Eθ (Y ) = ∇θ

Z Y Lθ dµ =

Y ∇θ Lθ dµ

= Eθ (Y ∇θ log Lθ ) − Eθ (Y )Eθ (∇θ log Lθ ) = Eθ ((Y − Eθ (Y ))∇θ log Lθ ). For each vector υ of Rk < υ, ∇θ Eθ (Y ) >= Eθ (< υ, ∇θ log Lθ > (Y − Eθ (Y ))), and by Cauchy-Schwarz inequality, we deduce | < υ, ∇θ Eθ (Y ) > |2 = |Eθ (< υ, ∇θ log Lθ > (Y − Eθ (Y )))|2 ≤ Eθ (< υ, ∇θ Eθ (Y ) >2 )Eθ ((Y − Eθ (Y ))2 ) 29

and since Eθ (1) = 1 we have Var(Y ) ≥

< υ, ∇θ Eθ (Y ) >2 Eθ (< υ, ∇θ log Lθ >2 )

Choosing υ = FI(θ)−1 · ∇θ Eθ (Y ) we obtain exactly the conclusion of the theorem Varθ (Y ) ≥ ∇θ Eθ (Y )> · FI(θ)−1 · ∇θ Eθ (Y ).¤ Also the Theorem 3.8 holds with Y be a multidimensional random vector and F (θ) be a vector function. Let Y an estimator A application of Theorem 3.8 gives Kθ (Y ) ≥ ∇θ F (θ)> · FI(θ)−1 · ∇θ F (θ), where Kθ (Y ) is the covariance matrix.

Theorem 3.9 Cramer-Rao Uncertainty Principle— Let X be a random vector of covariance KX . Then KX ≥ FI(X)−1 .

(70)

II.Weyl-Heisenberg Uncertainty Principle— Two random vectors of Rn of density f and g respectively two complex b f = |ϕ|2 /kϕk2 and g = functions ϕ and ψ in L2 (Rn ) such that ϕ = ψ, 2 |ψ|2 /kψk22 , where ψb denotes the Fourier transform of ψ and therefore √ √ f = | cg|2 /k cgk22 .

(71)

Theorem 3.10 Let X and Y two random vectors with covariances KX and KY respectively. Then −1 16π 2 KY − KX ≥0

and 30

16π 2 KX − KY−1 ≥ 0.

(72)

III.Beckner-Hirschman Uncertainty Principle— Theorem 3.11 Let X and Y two random vectors. Then 16π 2 N(X)N(Y ) ≥ 1

(73)

Proof b p0 ≤ cn/2 The Hausdorff-Young inequality (see Appendix) asserts that kψk p kψkp n/2 and so kϕkp0 ≤ cp kψkp where cp is the constant of Young inequality (see Appendix) and (p0 denotes the dual index of p i.e. 1/p + 1/p0 = 1). We note |ϕ|2 the density of X and |ϕ| b 2 the density of Y . By HausdorffYoung inequality we obtain log kϕkp0 − log kϕk b P ≤ log cn/2 p for 1 < p ≤ 2, p0 ≥ 2 with equality when p = p0 = 2. Deriving at p0 = 2 we have 1 1 (∂p0 log kϕkp0 )|p0 =2 = − H(|ϕ|2 ) = − H(X), 4 4 and 1 (∂p log kϕk b p )|p=2 = −(∂p log kϕk b p )|p=2 = H(Y ) 4 n/2

Also, (∂p log cp )|p=2 = −n(1 − log 2)/4 and we obtain b 2 ) ≥ n(1 − log 2) H(|φ|2 ) + H(|φ| i.e. the Beckner-Hirschman Uncertainty Principle H(X) + H(Y ) ≥ n(1 − log 2).¤ Remarks (I) By Proposition 3.1 we have N(X) ≤ |KX |1/n and N(Y ) ≤ |KY |1/n and hence the Theorem 3.11 gives 31

16π 2 |KX |1/n |KY |1/n ≥ 1

(74)

that is the Weyl-Heisenberg Uncertainty Principle in dimension n = 1. 2 (II) Setting φ(x) = 2n/2 e−π|x| f (x), we obtain the Beckner-Hirschman Uncertainty Principle for gaussian measure 1 EN (0, In ) (|∇f |2 ), 4π 4π 4π 2π where W notes Wiener transform that is defined by EntN (0, In ) (|f |2 ) + EntN (0, In ) (|Wf |2 ) ≤

2 2 \ Wf (x) = eπ|x| e−π|·| f (·)(x).

(75)

(76)

Appendix Central Limit Theorem— If X1 , X2 , ... are identically distributed random variables, E(Xn ) = µ and Var(X1 +P... + Xn ) = σ 2 , with 0 < σ < ∞, n X −nµ converges in distribution then the normal convergence holds; i.e. i=1σ√ni to a random variable with distribution N (0, 1).

Young Inequality— Let 1 ≤ r, p, q ≤ ∞ real numbers such that 1+1/r = 1/p + 1/q. Then for functions f ∈ Lp (Rn ) and g ∈ Lq (Rn ), µ kg ∗ gkr ≤

cp cq cr

¶n/2 kf kp kgkq

1/p

where cp = pp01/p0 (with 1/p + 1/p0 = 1) If 0 ≤ r, p, q ≺ 1, the inequality is inverse: µ kg ∗ gkr ≥

cp cq cr

¶n/2 kf kp kgkq

Fourier transform—The Fourier transform of f ∈ L1 (Rn )

32

Z fb(k) =

e−2πikx f (x)dx. Rn

where we denote (x, y) = x1 y1 + ... + xn yn and |x| = (x, x)1/2 for x = (x1 , ..., xn ), y = (y1 , ..., yn ) ∈ Rn . Also f ∨ (x) = fb(−x) denotes the inverse Fourier transform of an integrable function f .

Hausdorff-Young Inequality— Let 1 < p < 2 and let f ∈ Lp (Rn ) Then, with 1/p + 1/p0 = 1,

T

L1 (Rn ).

||fb||p0 ≤ Cpn ||f ||p with 0

Cp2 = [p1/p (p0 )−1/p ]. Further more, equality is achieved in The above inequality if and only if f is a Gaussian function of the form f (x) = Ae−(x,M x)+(B,x) with A ∈ C, M any symmetric, real, positive-definite matrix and B any vector in Cn . References [1] C. An´e, S. Blach`ere, D. Chafa¨ı, P. Foug`eres, I. Gentil, F. Malrieu, C. Roberto, G. Scheffer, Sur les in´egalit´es de Sobolev logarithmiques, 10. Soci´et´e Math´ematique de France, Paris, 2000. [2] W. Beckner, Sharp Sobolev inequalities on the sphere and the MoserTrudinger inequality, Ann. of Math., 138 (2), (1993), 213-243. [3] W. Beckner, Pitt’s inequalitiy and the uncertainty principle, Proc. Amer. Math. Soc., 123 no. 6, (1995), 1897-1905. [4] W. Beckner and M. Pearson, On sharp Sobolev embedding and the logarithmic Sobolev inequalities, Bull. London Math. Soc., 30, (1998), 80-84. 33

[5] S. G. Bobkov, M. Ledoux, On modified logarithmic Sobolev inequalities for Bernoulli and Poisson measures, J. Funct. Anal., 156, (1998), 347365. [6] A. Cotsiolis, N.K. Tavoularis, On logarithmic Sobolev inequalities for higher order fractional derivatives, C.R.Acad.Sci.Paris, Ser. I, 340, (2005), 205-208. [7] R. A. Fisher, Theory of statistical estimation, Proc. Cambridge Philos. Soc. 22, (1925), 700-725. [8] I. Ford, C. P. Kitsos, D.M. Titterington, Recent Advances in Nonlinear Experimental Design, Technometrics, Vol. 31, (1989), 49-60. [9] C. P. Kitsos, D.M. Titterington, B. Torsney, (1988), An Optimal Design Problem in Rhythmometry, Biometrics, Vol. 44, pg. 657-671. [10] C. P. Kitsos, Fully sequential procedures in Nonlinear Design problems, Computational Statistics and Data Analysis, Vol. 8, (1989), 13-19. [11] C. P. Kitsos, Ch. Muller, Robust Linear Calibration, Statistics, 27, (1995), 93-106. [12] C. P. Kitsos, The Role of Covariates in Experimental Carcinogenesis, Biometrical Letters, Vol. 35, No 2, (1998), 95-106. [13] L. Gross, Logarithmic Sobolev inequalities, Amer. J. Math., 97, no761, (1975), 1061-1083. [14] E. Nelson, The free Markoff field, J. Funct. Anal., 12, (1973), 211-227. [15] M. Del Pino, J. Dolbeault, I. Gentil, Nonlinear diffusions, hypercontractivity and the optimal Lp -Euclidean logarithmic Sobolev inequality, J. Math. Anal. Appl. 293, (2004), no. 2, 375-388. [16] G. Royer, Une initiation aux in´egalit´es de Sobolev logarithmiques, Cours Sp´ecialis´es, 5. Soci´et´e Math´ematique de France, Paris, 1999. [17] C. E. Shannon, A mathematical theory of communication, Bell System Tech. J. 27, (1948), 379-423, 623-656. [18] A. J. Stam, Some inequalities satisfied by the quantities of information of Fisher and Shannon, Inform. and Control, 2, (1959), 255-269. 34

[19] F. B. Weissler, Logarithmic Sobolev inequalities for the heat-diffusion semigroup, Trans. Amer. Math. Soc., Vol. 237, (1978), 255-269.

35