On the Geometry of Maximum Entropy Problems

2 downloads 0 Views 238KB Size Report
Shannon as quoted in M. Tribus, E.C. McIrvine, Energy and information, Sci- ..... Considerable generalization of Burg's problem (Byrnes, Georgiou and Lindquist ...
On the Geometry of Maximum Entropy Problems Michele Pavon Department of Mathematics University of Padova, Italy

Joint work with Augusto Ferrante, Department of Information Engineering, University of Padova.

Courant Institute of Mathematical Sciences, NYU, April 14, 2014

Claude Shannon on Entropy My greatest concern was what to call it. I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. John von Neumann had a better idea, he told me, “You should call it entropy, for two reasons. In the first place your uncertainty function goes by that name in statistical mechanics. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.” ∗



Shannon as quoted in M. Tribus, E.C. McIrvine, Energy and information, Scientific American, 224 (September 1971), 178-184.

Overture: Four famous maximum entropy problems • 1877: Boltzmann’s loaded dice

• 1931: Schr¨ odinger’s Bridges

• 1967: Burg’s spectral estimation method

• 1972: Dempster’s covariance selection

1877: Boltzmann’s loaded dice

1877: Boltzmann’s loaded dice Reviewer of our SIAM Review paper:

Commonly, Boltzmann’s work cited in T.Cover and J.Thomas, Elements of Information Theory, is regarded the key one, and the item ”Boltzmann’s work in Statistical Physics” of Stanford Encyclopedia of Philosophy supports this view. Still, some doubts about Boltzmanns exact contributions do remain; I would strongly suggest the authors to try to access Boltzmanns original work †, and use first-hand information to resolve historical uncertainties.



¨ ber die Beziehung zwischen dem zweiten Hauptsatze der mech-L.Boltzmann, U anischen W¨ armetheorie und der Wahrscheinlichkeitsrechnung resp. den S¨ atzen uber das W¨ ¨ armegleichgewicht. Wiener Berichte 76, 373-435, 1877.

1877: Boltzmann’s loaded dice (cont’d) Boltzmann’s dice are... molecules!

N molecules that can take the p + 1 values of lebendige Kraft 0, , 2, . . . , p (the classical vis viva originating with Gottfried Leibniz which was actually twice the kinetic energy).

Suppose ni molecules have kinetic energy i, i = 0, 1, . . . , p. We then have a “macrostate”, a Zustandverteilung in Boltzmann’s lan! guage, indexed by (n0, n1, . . . , np) corresponding to n !nN!...n “mip! 0 1 crostates” (Komplexion‡) each having probability (p + 1)−N .



Komplexion is a microstate and not a macrostate as stated in J.Uffink, Boltzmann’s Work in Statistical Physics, Stanford Encyclopedia Of Philosophy, 2004.

1877: Boltzmann’s loaded dice (cont’d) Suppose that the sum of the kinetic energy of all molecules is a given quantity λ = L. Boltzmann proceeds to find the macrostate which corresponds to more microstates. He maximizes the multinomial coefficient

N! n0!n1! . . . np! under the constraints p X i=1

ni = N,

p X i=1

i · ni = λ.

1877: Boltzmann’s loaded dice (cont’d) By crude De Moivre-Stirling’s approximation N ! ≈ e−N N N

N!

p  ni Y N

e−N N N

p Y −n ln = e i

≈ Qp ni = −n in n1!n2! . . . np! ni i=1 e i i=1 i=1   Pp n ni − i=1 ni ln Ni N H(π) , i = 1, 2, . . . , p. e =e , πi = N



ni N



=

Thus, for N large, problem is almost equivalent to maximizing the Shannon entropy

H(π) = −

p X

πi ln (πi)

i=1

under constraint p X i=1

i · πi = λ/N.

1877: Boltzmann’s loaded dice (cont’d) Solution has form

πi∗ =

eµi Z

,

Z=

p X

eµi

(1)

i=1

where the µi must be such that p X i=1

eµi

i · Pp

i=1

eµi

= λ/N.

More generally, the maximizer of entropy H(π) subject to a linear constraint Lπ = c has form of a Boltzmann-Gibbs distribution

1 −hΛ,L i k πk = e Z

(2)

where Lk is kth column of L and Z (“zust¨ andige Summe” -pertinent sum) a normalizing constant. Similarly, in the continuous setting.

1931/32: Schr¨ odinger’s Bridges

1931/32: Schr¨ odinger’s Bridges • Cloud of N independent Brownian particles;

• empirical distributions ρ0(x)dx and ρ1(x)dx at times t0 and t1, resp. ρ1(x)dx considerably differs from what it should be according to the law of large numbers, namely Z t  1 ρ1(x) 6= p(t0, y, t1, x)ρ0(y)dy dx, t0 " # 2 |x − y| −n 2 exp − , p(s, y, t, x) = [2π(t − s)] 2(t − s)

s < t.

• particles transported in an unlikely way (N large);

• of the many unlikely ways in which this could have happened, which one is the most likely?

1931/32: Schr¨ odinger’s Bridges (cont’d) Solution is also a Markov process (a multiplicative functional transformation of Brownian motion) which has at each time a density

ρ that factors as ρ(x, t) = ϕ(x, t)ϕ(x, ˆ t), where ϕ and ϕ ˆ solve the system

Z ϕ(t, x) =

p(t, x, t1, y)ϕ(t1, y)dy,

ϕ(t0, x)ϕ(t ˆ 0, x) = ρ0(x)

p(t0, y, t, x)ϕ(t ˆ 0, y)dy,

ϕ(t1, x)ϕ(t ˆ 1, x) = ρ1(x).

Z ϕ(t, ˆ x) =

F¨ ollmer 1986: This is a problem of large deviations of the empirical distribution on path space connected through Sanov’s theorem to a maximum entropy problem. “Merkw¨ urdige Analogien zur Quantenmechanik, die mir sehr des Hindenkens wert erscheinen” (remarkable analogies to quantum mechanics which appear to me very worth of reflection)

1967: Burg’s spectral estimation method

1967: Burg’s spectral estimation method Suppose the covariance lags ck = E[y(k)y(0)], k = 0, 1, . . . , n − 1 of a stationary, zero-mean, Gaussian process have been estimated from data {y1...yN }. How should one extend the covariance? Burg: set other covariance lags to values that maximize the entropy rate of the process

hr (y) := lim

1

n→∞ 2n

+1





H pY[−n,n] =

Z π 1

log Φy (ejϑ)dϑ + C.

4π −π   if limit exists Φy (ejϑ) = F ({ck })(ejϑ) . Solution is autoregressive process of the form

y(m) =

n−1 X

ak y(m − k) + w(m),

k=1

where w is a zero-mean, Gaussian white noise (i.i.d.)

sequence

with variance σ 2. Parameters a1, . . . , an−1, σ 2 such that the first n covariance lags match given ones.

1972: Dempster’s covariance selection

1972: Dempster’s covariance selection Consider a zero-mean, multivariate Gaussian distribution with density   1 p(x) = (2π)−n/2|Σ|−1/2 exp − x>Σ−1x , x ∈ Rn. 2 ¯ , with Suppose that the elements {σij ; 1 ≤ i ≤ j ≤ n, (i, j) ∈ I}

¯ for all i = 1 . . . n, have been specified. (i, i) ∈ I How should Σ be completed? Dempster: Use Principle of Parsimony of parametric model fitting. As the elements σ ij of Σ−1 appear as natural parameters of the

¯ (σ ij = 0 ≡ i-th model, set σ ij to zero for 1 ≤ i ≤ j ≤ n, (i, j) 6∈ I and j -th components are conditionally independent given the other components).

1972: Dempster’s covariance selection (cont’d) A positive definite completion Σ◦ of Σ is called a Dempster Com-

¯. pletion if [(Σ◦)−1]i,j = 0 for all (i, j) 6∈ I Dempster proves: when a symmetric, positive-definite completion of Σ exists, then there is unique Dempster’s Completion Σ◦. This completion maximizes the (differential) entropy Z 1 1 H(p) = − log(p(x))p(x)dx = log(det Σ) + n (1 + log(2π)) n 2 2 R among zero-mean Gaussian distributions having prescribed elements

¯ . Thus, Dempster’s Completion Σ◦ {σij ; 1 ≤ i ≤ j ≤ n, (i, j) ∈ I} solves a maximum entropy problem, i.e. maximizes entropy under linear constraints.

Our goal Want to show that simple geometric result suffices to derive the form of the optimal solution in a large class of finite and infinitedimensional maximum entropy problems concerning probability distributions, spectral densities and covariance matrices.

A variational result Let H be a Hilbert space and let F : H → R be a functional. We say that F is Gˆ ateaux-differentiable at h0 in direction v if the limit

F 0(h0; v) := lim

→0

F (h0 + v) − F (h0) 

exists. In this case, F 0(h0; v) is called the directional derivative of

F at h0 in direction v .Let V ⊆ H be a (not necessarily closed) subspace and h ∈ H. Consider coset W := h + V . For w ∈ W and

v ∈ V , (w + v) ∈ W for all for all real ,, namely w is an internal point of W in direction v . Definition 1 We say that wc is a critical point of F over W = h+V if F 0(wc; v) = 0 for all v ∈ V .

A variational result (cont’d) Theorem 1 Let W := h + V be an affine space. Assume that the functional F is Gˆ ateaux-differentiable at wc ∈ W in any direction

v ∈ V and that the Gˆ ateaux differential is given by the linear, continuous map F 0(wc; v) = hDV F (wc), viH where DV F (wc) ∈ H . Then

wc is a critical point of F over W if and only if DV F (wc) ∈ V ⊥. When F is actually Fr´ echet differentiable at wc ∈ W , wc is critical if and only if DF (wc) ∈ V ⊥. Proof. F 0(wc; v) = 0 for all v ∈ V if and only if DV F (wc) belongs to the annihilator of V , i.e. hDV F (wc), viH = 0, ∀v ∈ V .

May be generalized in many ways.



Matricial variational problems Let H = Cn×n (or H = Rn×n) be the space of n × n matrices endowed with the inner product hM1, M2i := trace[M1∗M2], where

∗ denotes transposition plus conjugation. Lemma 1 Let F (M ) := log | det[M ]|. If M is nonsingular then, for all δM ∈ H

F 0(M ; δM ) = trace[M −1δM ].

F is Fr´ echet differentiable and DF (M ) = M −∗.

(3)

Matricial variational problems (cont’d) We are interested in extremizing F (M ) := log | det[M ]| over the affine space W = A + V , where A ∈ H and V is a subspace of H. Theorem 2 Let W = A + V be an affine space. Then a nonsingular matrix Mc ∈ W extremizes F (M ) = log | det[M ]| over W if and only if Mc−∗ ∈ V ⊥.

Back to Dempster’s Covariance Selection Affine space W is set of symmetric matrices having elements {σij ; 1 ≤

¯ . Entropy ( 1 log det Σ) must be maximized on i ≤ j ≤ n, (i, j) ∈ I} 2 the intersection between W and the convex cone {Σ ≥ 0}.

• W is translation of subspace V of symmetric matrices having ¯; zeros in the positions I ¯ for all i = 1 . . . n, i.e. [Σ]ii are all fixed so that • (i, i) ∈ I q |[Σ]ij | ≤ [Σ]ii[Σ]jj and hence feasible set is bounded; • As Σ approaches boundary of cone, H(p) tends to −∞.

Thus, when problem feasible, seek optimal solution in the interior of the cone {Σ > 0}.

Back to Dempster’s Covariance Selection (cont’d) Repeat locally argument of Theorem 2 to conclude that maximum ⊥. entropy completion Σc is such that Σ−1 ∈ V c

¯ c, Let ei denote the i-th canonical vector in Rn. For (i, j) ∈ I := I ⊥, we must have belongs to V . If M ∈ V the rank one matrix eie> j >M ] = trace[e e>M ] = e>M e = [M ] , 0 = trace[(eie> ) j i j ij j i

∀(i, j) ∈ I.

Thus V ⊥ is the space of matrices having zeros in I , the complement

¯ . Thus, the maximum entropy completion Σc is a Dempster’s of I completion. General matrix completion (no positivity, nonsquare matrices) A. Ferrante and M. Pavon, Matrix Completion ` a la Dempster by the Principle of Parsimony, IEEE Trans. Information Theory, Vol. 57, 3925-3931, June 2011.

Matricial variational problems Let now H = L2(T, Cn×n) be the Hilbert space of Cn×n-valued functions defined on the unit circle T, with scalar product Z π h i 1 hΦ, ΨiH := trace Φ(ejθ )∗Ψ(ejθ ) dθ. 2π −π Let W = A + V be an affine space in L2(T, Cn×n), namely A ∈

L2(T, Cn×n) and V is a subspace of L2(T, Cn×n). By Theorem 1: Theorem 3 Let W = A + V be as above and Φc(ejθ ) ∈ W be 2 n×n). Then Φ extremizes such that Φ−1 c c ∈ L (T, C Z π 1 log | det[Φ(ejθ )]|dθ F (Φ) = 2π −π ⊥ over W if and only if DF (Φ) = Φ−∗ c ∈V .

Back to Burg’s spectral estimation method Let Ck , k = 0, 1, . . . , n−1 of dimension m×m be some estimated covariance lags of an unknown stationary Gaussian process y . Burg’s problem: Find a stationary process y with spectral density Φy which maximizes the index

F (Φ) =

Z π 1

log det Φy (ejϑ)dϑ.

(4) 2π −π among all spectral densities having as first n Fourier coefficients

Ck , k = 0, 1, . . . , n − 1. Assume that block-Toeplitz matrix Σn   C0 C1 · · · Cn−1  C∗ C0 · · · Cn−2  1   Σn =  (5) ... ... ... ...  ∗ ∗ Cn−1 Cn−2 ... C0 is positive definite. Then there are infinitely many spectra having prescribed Fourier coefficients.

Back to Burg’s spectral estimation method (cont’d) Consider matrix pseudo-polynomial

P (ejϑ) =

n−1 X

Ck e−jϑk ,

C−k := Ck∗.

k=−n+1

• Vn subspace of L2(T, Cn×n) of functions whose Fourier coefficients Ri vanish for all i = −n + 1, . . . n − 1 and obey to the ∗ ; symmetry constraint Ri = R−i

• the affine space W is defined by W = P + Vn; • S convex cone of coercive spectra with bounded inverse.

Then the constraint is Φ ∈ W ∩ S , F is strictly concave on S , extremizer Φc is a maximum point.

Back to Burg’s spectral estimation method (cont’d) ⊥. By Theorem 3, this maximum point Φc is such that Φ−∗ ∈ V c n ⊥ Observe, moreover, that, being a spectrum, Φc = Φ∗c and that Vn

is given by the matricial polynomials of the form

Q(ejϑ) =

n−1 X

Ak e−jϑk ,

A−k = A∗k .

k=−n+1

We conclude that the optimal spectrum has the form  −1 n−1 X jϑ Φc(e ) =  A◦k e−jϑk  , A◦−k = (A◦k )∗ k=−n+1 for some matrices A◦k , k = −n + 1, . . . , 0, . . . , n − 1 which permit to

satisfy the constraints on the first n coefficients. Thus, solution process is an AR process!

A more general moment problem (cont’d) Considerable generalization of Burg’s problem (Byrnes, Georgiou and Lindquist and co-workers in past 15 years):

y(t) -

G(z)

x(t) -

G(z) = (zI − A)−1B

where y is unknown, stationary, p.n.d., Cm-valued, |λ(A)| < 1, B has full column rank, (A, B) is a reachable pair.

Let x be the

n-dimensional stationary output process xk+1 = Axk + Byk , k ∈ Z.

(6)

Let Σ = E{xk x∗k } be the covariance of xk . Spectrum Φ must then satisfy the following moment constraint Z GΦG∗ = Σ. (7)

A more general moment problem (cont’d) Consider problem of determining spectral densities Φ satisfying (7) for a given Σ > 0. The covariance extension is a special case of this problem corresponding to G(z) := [z −k I | z −k+1I | · · · | z −1I]> and Σ equal to the Toeplitz matrix in (5).

• Can design G to obtain higher resolution in prescribed frequency bands!

• Framework also includes celebrated Nevanlinna-Pick interpolation problem of fundamental importance in various H ∞ control problems!

A more general moment problem (cont’d) Let H = L2(T, Hm). Consider now linear operator Z Γ : L2(T, Hm) → Hn, Φ 7→ GΦG∗. Moment constraint feasible ∼ Σ must belong to Range Γ

R

Generalization Burg’s problem: Maximize the entropy index log det Φ R subject to GΦG∗ = Σ. Suppose that Φ0 satisfies moment. Then

W = Φ0 + V, where V = {Φ|

R

GΦG∗ = 0}. In other words, V = ker Γ.

A more general moment problem (cont’d) The constraint can be expressed as Φ ∈ W ∩ S , where S is the convex cone of coercive spectra with bounded inverse. Since Z  Z  Z h GΦG∗, M iHn := tr GΦG∗M = tr ΦG∗M G = hΦ, G∗M GiH, we have that the adjoint of Γ is

Γ∗ : Hn → C(T, Hm),

M 7→ G∗M G.

In particular, Range Γ∗ = {Φ = G∗M G, M ∈ Hn}. Since Range Γ∗ is finite-dimensional, it is necessarily closed and we have

V ⊥ = [ker Γ]⊥ = Range Γ∗ = {Φ = G∗M G, M ∈ Hn}. ⊥. By Theorem 3, the maximum point Φc is such that Φ−∗ ∈ V c

A more general moment problem (cont’d) Hence, the optimal spectrum has form h i−1 Φc(ejθ ) = G(ejθ )∗ΛcG(ejθ ) ,

(8)

for some Hermitian Λc such that G(ejθ )∗ΛcG(ejθ ) > 0 on T and the constraint is satisfied, namely Z  ∗ −1 ∗ G G Λc G G = Σ. Indeed, Georgiou showed in 2002 that the unique solution of the generalized Burg problem has the form (8) with  −1 Λc = Σ−1B B ∗Σ−1B B ∗Σ−1.

(9)

Variational entropy problems with “prior” Recall: The relative entropy or

Kullback-Leibler pseudo-distance

or divergence between two probability densities p and q , with the support of p contained in the support of q , is defined by Z p(x) dx, D(pkq):= p(x) log n q(x) R Consider two zero-mean, jointly Gaussian, stationary, purely nondeterministic processes y = {yk ; k ∈ Z} and z = {zk ; k ∈ Z} taking values in Rm. Consider the relative entropy rate Dr (ykz) between

y and z defined as Dr (ykz) := lim

n→∞ 2n

1 +1

D(pY[−n,n] kpZ[−n,n] ) if the limit exists

Variational entropy problems with “prior” (cont’d) Theorem 4 (M. Pinsker) Let y = {yk ; k ∈ Z} and z = {zk ; k ∈

Z} be Rm-valued, zero-mean, Gaussian, stationary, purely nondeterministic processes with spectral density functions Φy and Φz , respectively. Assume, moreover, that either Φy Φ−1 z is bounded or Φy ∈

L2 (−π, π) and Φz is coercive (i.e. ∃ α > 0 s.t. Φz (ejϑ) − αIm > 0 a.e. on T). Then

Z π n   1 jϑ jϑ −1 Dr (ykz) = log det Φy (e )Φz (e ) 4π −π io  h jϑ jϑ jϑ +tr Φ−1 dϑ := F (Φ, Ψ). z (e ) Φy (e ) − Φz (e )

(10)

This index has the form of a multivariate Itakura-Saito divergence of speech processing.

Variational entropy problems with “prior” (cont’d) Generalized moment problem with prior: Minimize F (Φ, Ψ) subject R to GΦG∗ = Σ. Solution sought in W ∩ S as before. As before,

V ⊥ = {Φ = G∗M G, M ∈ Hn}. Only entropy functional different. Orthogonality condition reads   −1 −1 Φc − Ψ ∈ V ⊥, namely optimal spectrum has form h i−1 −1 ∗ Φc = Ψ + G ΛcG ,

Λc ∈ Hn,

where Λc permits to satisfy moment constraint. Details in: A. Ferrante, C. Masiero and M. Pavon, Time and spectral domain relative entropy: A new approach to multivariate spectral estimation, IEEE Trans. Aut. Control, Vol. 57-10: 2561–2575, 2012.

Maximum Shannon entropy problems Let (X, X , µ) be a finite measure space. Let ϕi, i = 1, . . . , d be in L2(X, X , µ) and α ∈ Rd. Problem: find p ≥ 0 in L∞(X, X , µ) maximizing the Shannon entropy

Z F (p) = Hµ(p) = −

log[p(x)]p(x)dµ X

under the constraints Z Z p(x)dµ = 1, X

X

ϕi(x)p(x)dµ = αi,

i = 1, . . . , d.

(11)

Let pc ∈ L∞(X, X , µ) be nonnegative and bounded away from zero

µ a.e. Let δp ∈ L∞(X, X , µ). Then Z F 0(pc; δp) = [−1 + log pc(x)] δp(x)dµ = h−1 + log pc, δpiH. X

Maximum Shannon entropy problems (cont’d) Suppose there exists p0 > 0, p0 ∈ L∞(X, X , µ) satisfying the constraints.

Then p ∈ L∞(X, X , µ) also satisfies the constraints if

it belongs to the affine space p0 + V where V is the subspace of functions f ∈ L∞(X, X , µ) such that Z f (x)dµ = 0, ZX ϕi(x)f (x)dµ = 0,

(12)

i = 1, . . . , d.

(13)

X

Observe now that V ⊥ is the subspace of functions of the form Pd ϑ0 + i=1 ϑiϕi(x). Observe also that for pc bounded and bounded away from zero as above, log pc also belongs to L∞(X, X , µ) and, consequently, to L2(X, X , µ).

Maximum Shannon entropy problems (cont’d) By Theorem 1 we conclude that (−1 + log pc) ∈ V ⊥, it must namely be of the form

 pc(x) = C exp 

d X

 ϑiϕi(x) .

(14)

i=1

If there exist constants C and ϑi, i = 1, . . . , d such that pc belongs to L∞(X, X , µ), it is bounded away from zero µ a.e. and it satisfies the constraints, then it is indeed optimal due to the concavity of the entropy. This is just the well-known fact that, if the maximizer exists, it belongs to the exponential family.

References: Talk based on: M. Pavon and A. Ferrante, On the geometry of maximum entropy problems, SIAM Review, Vol. 55-3: 415–439, Sept. 2013. Contains also new application to reciprocal processes on discrete circle with prior.

Further applications: On well posedness A. Ferrante, M. Pavon and M. Zorzi, A maximum entropy enhancement for a family of high-resolution spectral estimators, IEEE Trans. Aut. Control, Vol. 57, Issue 2, Feb. 2012, 318-329. Reciprocal processes on discrete circle F. Carli, A. Ferrante, M. Pavon, and G. Picci, A Maximum Entropy solution of the Covariance Extension Problem for Reciprocal Processes, IEEE Trans.

Aut.

Control, Vol. 56:1999–2012,

September 2011. F. Carli, A. Ferrante, M. Pavon, and G. Picci An Efficient Algorithm for MaximumEntropy Extension of Block-Circulant Covariance Matrices, 2013, Linear Algebra and its Applications, Vol. 439:2309–2329, 2013.

Motto over the entrance to Plato’s Academy: “Aγωµ´ τ ρητ oς µηδ` ις ισ ητ ´ ω ”, namely “Let no one untrained in geometry enter.”