Generalized maximum likelihood estimates for exponential families

Logarithmic affinity Exponential families Maximizing likelihood Generalized mle Divergence from ef

Generalized maximum likelihood estimates for exponential families Imre Csisz´ar1 1 A.

Frantiˇsek Mat´ uˇs2

R´ enyi Institute of Mathematics, Hungarian Academy of Sciences, Budapest

2 Institute

of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague

March 6, 2007 Institute for Mathematics and its Applications University of Minnesota



Log-affine combinations Log-affine envelope Sufficiency Kernels and truncations

P ... a probability measure (pm) on a finite set Ω,



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative.



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative. s(P) = {ω ∈ Ω : P(ω) > 0} ... the support of a pm P,



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative. s(P) = {ω ∈ Ω : P(ω) > 0} ... the support of a pm P, P sits on s(P).



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative. s(P) = {ω ∈ Ω : P(ω) > 0} ... the support of a pm P, P sits on s(P). For pm’s P, Q on Ω with A = s(P) ∩ s(Q) nonempty and t ∈ R, the log-affine combination of P and Q is the pm P t Q 1−t sitting on A and proportional to ω 7→ P(ω)t Q(ω)1−t



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative. s(P) = {ω ∈ Ω : P(ω) > 0} ... the support of a pm P, P sits on s(P). For pm’s P, Q on Ω with A = s(P) ∩ s(Q) nonempty and t ∈ R, the log-affine combination of P and Q is the pm P t Q 1−t sitting on A and proportional to ω 7→ P(ω)t Q(ω)1−t ω1

ω2

ω3

ω4



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative. s(P) = {ω ∈ Ω : P(ω) > 0} ... the support of a pm P, P sits on s(P). For pm’s P, Q on Ω with A = s(P) ∩ s(Q) nonempty and t ∈ R, the log-affine combination of P and Q is the pm P t Q 1−t sitting on A and proportional to ω 7→ P(ω)t Q(ω)1−t ω1

ω2

ω3

ω4




P

ω1

ω2

ω3

1 4

1 4

1 2

ω4 0




P Q

ω1

ω2

ω3

1 4

1 4 1 4

1 2 1 8

0

ω4 0 5 8




P Q P t Q 1−t

ω1

ω2

ω3

1 4

1 4 1 4 1 2

1 2 1 8 1 2

0 0

| {z } A

ω4 0 5 8

0

t=

1 2



P ... a probability measure (pm) on a finite set Ω, P ω∈Ω P(ω) = 1, each P(ω) nonnegative. s(P) = {ω ∈ Ω : P(ω) > 0} ... the support of a pm P, P sits on s(P). For pm’s P, Q on Ω with A = s(P) ∩ s(Q) nonempty and t ∈ R, the log-affine combination of P and Q is the pm P t Q 1−t sitting on A and proportional to ω 7→ P(ω)t Q(ω)1−t ... log-convex combinations if 0 6 t 6 1 P

ω∈Ω

P(ω)t Q(ω)1−t

6 1, tight if and only if P = Q.



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations.



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P.



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0 < p < 1}



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0< p < 1} of the pm’s Qp (ω) = ωn p ω (1 − p)n−ω on Ω = {0, 1, . . . , n}:



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0< p < 1} of the pm’s Qp (ω) = ωn p ω (1 − p)n−ω on Ω = {0, 1, . . . , n}: the log-affine combination of Qp and Qq at ω ∈ Ω is ∝ it h i1−t h n n ω (1 − q)n−ω ω (1 − p)n−ω q , p ω ω



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0< p < 1} of the pm’s Qp (ω) = ωn p ω (1 − p)n−ω on Ω = {0, 1, . . . , n}: the log-affine combination of Qp and Qq at ω ∈ Ω is ∝ it h i1−t h n n ω (1 − q)n−ω ω (1 − p)n−ω q , p ω ω Qpt Qq1−t = Qr with r =

p t q 1−t , p t q 1−t +(1−p)t (1−q)1−t



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0< p < 1} of the pm’s Qp (ω) = ωn p ω (1 − p)n−ω on Ω = {0, 1, . . . , n}: the log-affine combination of Qp and Qq at ω ∈ Ω is ∝ it h i1−t h n n ω (1 − q)n−ω ω (1 − p)n−ω q , p ω ω Qpt Qq1−t = Qr with r = P is log-affine,

p t q 1−t , p t q 1−t +(1−p)t (1−q)1−t



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0< p < 1} of the pm’s Qp (ω) = ωn p ω (1 − p)n−ω on Ω = {0, 1, . . . , n}: the log-affine combination of Qp and Qq at ω ∈ Ω is ∝ it h i1−t h n n ω (1 − q)n−ω ω (1 − p)n−ω q , p ω ω t 1−t

p q Qpt Qq1−t = Qr with r = pt q1−t +(1−p) t (1−q)1−t , P is log-affine, r ranges between 0 and 1 when t ∈ R and p 6= q,



A family P of pm’s on Ω is log-affine if it is closed to log-affine combinations. The log-affine envelope of a family P is the inclusion smallest log-affine family that contains P. Binomial family P = {Qp : 0< p < 1} of the pm’s Qp (ω) = ωn p ω (1 − p)n−ω on Ω = {0, 1, . . . , n}: the log-affine combination of Qp and Qq at ω ∈ Ω is ∝ it h i1−t h n n ω (1 − q)n−ω ω (1 − p)n−ω q , p ω ω t 1−t

p q Qpt Qq1−t = Qr with r = pt q1−t +(1−p) t (1−q)1−t , P is log-affine, r ranges between 0 and 1 when t ∈ R and p 6= q, P equals the envelope of any two of its pm’s.



The restriction of a pm P on Ω to A ⊆ Ω ( P(ω) ω ∈ A, P A (ω) = 0 otherwise.



The restriction of a pm P on Ω to A ⊆ Ω ( P(ω) ω ∈ A, P A (ω) = 0 otherwise. A partition π of Ω is sufficient for a family P of pm’s on Ω if dim {P A : P ∈ P} 6 1 for any block A ∈ π.



The restriction of a pm P on Ω to A ⊆ Ω ( P(ω) ω ∈ A, P A (ω) = 0 otherwise. A partition π of Ω is sufficient for a family P of pm’s on Ω if dim {P A : P ∈ P} 6 1 for any block A ∈ π. P = {P1 , P2 , P3 }







ω1

ω2

ω3

ω4



The restriction of a pm P on Ω to A ⊆ Ω ( P(ω) ω ∈ A, P A (ω) = 0 otherwise. A partition π of Ω is sufficient for a family P of pm’s on Ω if dim {P A : P ∈ P} 6 1 for any block A ∈ π. P = {P1 , P2 , P3 } P1

ω1

ω2

ω3

ω4

1 4

1 4

1 4

1 4



The restriction of a pm P on Ω to A ⊆ Ω ( P(ω) ω ∈ A, P A (ω) = 0 otherwise. A partition π of Ω is sufficient for a family P of pm’s on Ω if dim {P A : P ∈ P} 6 1 for any block A ∈ π. P = {P1 , P2 , P3 } P1 P2

ω1

ω2

ω3

ω4

1 4 1 2

1 4 1 2

1 4

1 4

0

0



The restriction of a pm P on Ω to A ⊆ Ω ( P(ω) ω ∈ A, P A (ω) = 0 otherwise. A partition π of Ω is sufficient for a family P of pm’s on Ω if dim {P A : P ∈ P} 6 1 for any block A ∈ π. P = {P1 , P2 , P3 } P1 P2 P3

ω1

ω2

ω3

ω4

1 4 1 2

1 4 1 2

1 4

1 4

0

0

0

1 2

1 2

0




ω1

ω2

ω3

ω4

1 4 1 2

1 4 1 2

1 4

1 4

0

0

1 2

1 0 0 2 | {z } |{z} |{z} A1

A2

A3

sufficient




ω1

ω2

ω3

ω4

1 4 1 2

1 4 1 2

1 4

1 4

0

0

1 2

1 0 0 2 |{z} | {z } |{z} A1

A2

A3

not sufficient




ω1

ω2

ω3

ω4

1 4 1 2

1 4 1 2

1 4

1 4

0

0

1 2

1 2

0 0 | {z } A1

| {z } A2

minimal sufficient




ω1

ω2

ω3

ω4

1 4 1 2

1 4 1 2

1 4

1 4

0

0

0

1 2

1 2

0

If sufficient for P then sufficient also for its log-affine envelope.



Π ... a Markov kernel between finite sets Ω, Ω0 ,



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative.



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π.



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π.



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π. PA = P A /P(A) ... the truncation of P to A with P(A) > 0,



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π. PA = P A /P(A) ... the truncation of P to A with P(A) > 0, the normalized restriction.



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π. PA = P A /P(A) ... the truncation of P to A with P(A) > 0, the normalized restriction. Truncations of log-aff comb’s equal log-aff comb’s of truncations.



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π. PA = P A /P(A) ... the truncation of P to A with P(A) > 0, the normalized restriction. Truncations of log-aff comb’s equal log-aff comb’s of truncations. Chentsov, N.N. (1972,82):



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π. PA = P A /P(A) ... the truncation of P to A with P(A) > 0, the normalized restriction. Truncations of log-aff comb’s equal log-aff comb’s of truncations. Chentsov, N.N. (1972,82): geometry of pm’s, also differential



Π ... a Markov kernel between finite sets Ω, Ω0 , P 0 0 ω 0 ∈Ω0 Π(ω, ω ) = 1, each Π(ω, ω ) > 0 nonnegative. Pm’s P on Ω transform to the pm’s PΠ on Ω0 by Π. If positive pm’s P, Q are invariant to a Markov kernel Π on Ω then their log-affine combinations are invariant to Π. PA = P A /P(A) ... the truncation of P to A with P(A) > 0, the normalized restriction. Truncations of log-aff comb’s equal log-aff comb’s of truncations. Chentsov, N.N. (1972,82): geometry of pm’s, also differential categories of pm’s with Markov morphisms


Definition Coordinatization of an ef Mean parametrization The closure of ef

Exponential family (ef, full) is the log-affine family of pm’s sitting on the same set.



Exponential family (ef, full) is the log-affine family of pm’s sitting on the same set. Fischer, R.A. (1934); Darmois, G. (1935); Koopman, L.H. (1936); Pitman, E.J.G. (1936); Chentsov, N.N. (1972,82); Barndorff-Nielsen, O. (1978); Brown, L.D. (1986); Letac, G. (1992); ...




Ω = Ω1 × Ω2




Ω = Ω1 × Ω2




Ω = Ω1 × Ω2 P the positive product measures on Ω




Ω = Ω1 × Ω2 P the positive product measures on Ω (Ω1 = Ω2 = {0, 1})



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set,



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set,



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) .



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i ,



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i , this is ω 7→ exp t1 f1 (ω) + . . . + td fd (ω) µ(ω)



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i , this is ω 7→ exp t1 f1 (ω) + . . . + td fd (ω) µ(ω) or ω 7→ e hθ,f (ω)i µ(ω) .



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i , this is ω 7→ exp t1 f1 (ω) + . . . + td fd (ω) µ(ω) or ω 7→ e hθ,f (ω)i µ(ω) .



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i , this is ω 7→ exp t1 f1 (ω) + . . . + td fd (ω) µ(ω) or ω 7→ e hθ,f (ω)i µ(ω) . θ = (t1 , . . . , td ) ... the canonical parameter



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i , this is ω 7→ exp t1 f1 (ω) + . . . + td fd (ω) µ(ω) or ω 7→ e hθ,f (ω)i µ(ω) . θ = (t1 , . . . , td ) ... the canonical parameter f = (f1 , . . . , fd ) ... the directional statistic



The log-affine envelope of P0 , P1 , . . . Pd , sitting on the same set, consists of the log-affine combinations, proportional to ω 7→ P1t1 (ω) · . . . · Pdtd (ω) · P01−t1 −...−td (ω) . With the notation µ = P0 and fi = ln PP0i , this is ω 7→ exp t1 f1 (ω) + . . . + td fd (ω) µ(ω) or ω 7→ e hθ,f (ω)i µ(ω) . θ = (t1 , . . . , td ) ... the canonical parameter f = (f1 , . . . , fd ) ... the directional statistic h·, ·i ... the scalar product on Rd



Hence, the full ef consists of the pm’s Q µ,f ,θ (ω) = exp hθ, f (ω)i − Λµ,f (θ) · µ(ω) ,

ω ∈ Ω,



Hence, the full ef consists of the pm’s Q µ,f ,θ (ω) = exp hθ, f (ω)i − Λµ,f (θ) · µ(ω) , ω ∈ Ω , h P i where θ ∈ Rd and Λµ,f (θ) = ln exp[hθ, f (ω)i] · µ(ω) . ω∈Ω




On the other hand, starting with




On the other hand, starting with a nonzero measure µ on Ω




On the other hand, starting with a nonzero measure µ on Ω and a directional statistic f : Ω → Rd ,




On the other hand, starting with a nonzero measure µ on Ω and a directional statistic f : Ω → Rd , E µ,f = Q µ,f ,θ : θ ∈ Rd is log-affine, its pm’s sit on s(µ).




On the other hand, starting with a nonzero measure µ on Ω and a directional statistic f : Ω → Rd , E µ,f = Q µ,f ,θ : θ ∈ Rd is log-affine, its pm’s sit on s(µ). Canonically convex ef Q µ,f ,θ : θ ∈ Θ for Θ ⊆ Rd convex.


For Ω = {0, 1, . . . , n}, µ(ω) =


n ω

and the embedding f : Ω → R,


For Ω = {0, 1, . . . , n}, µ(ω) =


n ω



For Ω = {0, 1, . . . , n}, µ(ω) =


n ω


Q µ,f ,θ (ω) = e θω−Λµ,f (θ) where Λµ,f (θ) = ln

n X ω=0

e θω

n ω

n ω

= ln 1 + e θ

n


For Ω = {0, 1, . . . , n}, µ(ω) =


n ω



n X ω=0

e θω

n ω

n ω

= ln 1 + e θ

n


For Ω = {0, 1, . . . , n}, µ(ω) =


n ω



n X

e θω

ω=0

Q µ,f ,θ (ω) = ωn p ω (1 − p)n−ω eθ where p = 1+e θ.

n ω

n ω

= ln 1 + e θ

n


For Ω = {0, 1, . . . , n}, µ(ω) =


n ω



n X

e θω

ω=0

Q µ,f ,θ (ω) = ωn p ω (1 − p)n−ω eθ where p = 1+e θ. E µ,f is Binomial family.

n ω

n ω

= ln 1 + e θ

n


µ ... nonzero measure on Ω




µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic



µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic µf ... the f -image of µ, a Borel pm on Rd



µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic µf ... the f -image of µ, a Borel pm on Rd concentrated on f (s(µ)) = {f (ω) : ω ∈ s(µ)}



µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic µf ... the f -image of µ, a Borel pm on Rd concentrated on f (s(µ)) = {f (ω) : ω ∈ s(µ)} cs(µf ) ... the convex support of µf ,



µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic µf ... the f -image of µ, a Borel pm on Rd concentrated on f (s(µ)) = {f (ω) : ω ∈ s(µ)} cs(µf ) ... the convex support of µf , the convex hull of f (s(µ)), a polytope



µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic µf ... the f -image of µ, a Borel pm on Rd concentrated on f (s(µ)) = {f (ω) : ω ∈ s(µ)} cs(µf ) ... the convex support of µf , the convex hull of f (s(µ)), a polytope ri(µf ) ... the relative interior of the polytope



µ ... nonzero measure on Ω f : Ω → Rd ... a directional statistic µf ... the f -image of µ, a Borel pm on Rd concentrated on f (s(µ)) = {f (ω) : ω ∈ s(µ)} cs(µf ) ... the convex support of µf , the convex hull of f (s(µ)), a polytope ri(µf ) ... the relative interior of the polytope P Taking the mean E P f = ω∈Ω f (ω)P(ω) of f under P, P 7→ E P f , is a homeomorphism between Eµ,f and ri(µf ).







-



Recall Λµ,f (θ) = ln

hX ω∈Ω

e

hθ,f (ω)i

Z i · µ(ω) = ln Rd

e hθ,xi µf (dx)




hX ω∈Ω

e

hθ,f (ω)i

Z i · µ(ω) = ln

e hθ,xi µf (dx)

Rd

the log-Laplace transform of the Borel measure µf




hX ω∈Ω

e

hθ,f (ω)i

Z i · µ(ω) = ln

e hθ,xi µf (dx)

Rd

the log-Laplace transform of the Borel measure µf (cumulant generating function)




hX ω∈Ω

e

hθ,f (ω)i

Z i · µ(ω) = ln

e hθ,xi µf (dx)

Rd

the log-Laplace transform of the Borel measure µf (cumulant generating function) convex, lower-semicontinuous




hX ω∈Ω

e

hθ,f (ω)i

Z i · µ(ω) = ln

e hθ,xi µf (dx)

Rd

the log-Laplace transform of the Borel measure µf (cumulant generating function) convex, lower-semicontinuous The gradient at θ P f (ω) · e hθ,f (ω)i · µ(ω) X ω∈Ω P hθ,f (ω)i = f (ω) · Qµ,f ,θ (ω) e · µ(ω) ω∈Ω ω∈Ω




hX ω∈Ω

e

hθ,f (ω)i

Z i · µ(ω) = ln

e hθ,xi µf (dx)

Rd

the log-Laplace transform of the Borel measure µf (cumulant generating function) convex, lower-semicontinuous The gradient at θ P f (ω) · e hθ,f (ω)i · µ(ω) X ω∈Ω P hθ,f (ω)i = f (ω) · Qµ,f ,θ (ω) e · µ(ω) ω∈Ω ω∈Ω

... the mean of f under Qµ,f ,θ .



The closure cl(Eµ,f ) of an ef in the topology of RΩ equals [ F

Eµf −1 (F ) ,f

where the union is over the (nonempty) faces F of cs(µf )




Eµf −1 (F ) ,f

where the union is over the (nonempty) faces F of cs(µf ) µf

−1 (F )

... the restriction of µ to f −1 (F ) ⊆ Ω




Eµf −1 (F ) ,f


−1 (F )

... the restriction of µ to f −1 (F ) ⊆ Ω ⊇: limn→∞ Qµ,f ,θ+nϑ = QµF ,f ,θ for some F




Eµf −1 (F ) ,f


−1 (F )

... the restriction of µ to f −1 (F ) ⊆ Ω ⊇: limn→∞ Qµ,f ,θ+nϑ = QµF ,f ,θ for some F ⊆: by the mean parameterizations in the union




Eµf −1 (F ) ,f


−1 (F )


Taking the mean of the statistic f , P 7→ EP f , is a homeomorphism between cl(Eµ,f ) and cs(µf ); the component Eµf −1 (F ) ,f corresponds to ri(F ).




Eµf −1 (F ) ,f


−1 (F )


Taking the mean of the statistic f , P 7→ EP f , is a homeomorphism between cl(Eµ,f ) and cs(µf ); the component Eµf −1 (F ) ,f corresponds to ri(F ). For a ∈ cs(µf ) denote by R ∗µ,f (a) the unique pm P of cl(Eµ,f ) such that a = EP f .


Likelihood function ml in log-convex families ml in ef ml in the closure of ef

sample (ω (1) , ..., ω (n) ), an n-tuple of elements of Ω



sample (ω (1) , ..., ω (n) ), an n-tuple of elements of Ω pm P on Ω



sample (ω (1) , ..., ω (n) ), an n-tuple of elements of Ω pm P on Ω a fit between the sample and the pm can be rated by n

P (ω

(1)

, ..., ω

(n)

)=

n Y i=1

P(ω (i) )




P (ω

(1)

, ..., ω

(n)

)=

n Y

P(ω (i) )

i=1

P 7→

Qn

i=1

P(ω (i) ) ... the likelihood function (fn) given the sample




P (ω

(1)

, ..., ω

(n)

)=

n Y

P(ω (i) )

i=1

P 7→

Qn

i=1


Maximum likelihood (ml) principle A maximizer of the likelihood function over a family P (ml estimate) provides the explanation of the sample.




P (ω

(1)

, ..., ω

(n)

)=

n Y

P(ω (i) )

i=1

P 7→

Qn

i=1


Maximum likelihood (ml) principle A maximizer of the likelihood function over a family P (ml estimate) provides the explanation of the sample. Lambert (1760); Bernoulli (1777); Laplace (1781); Gauss (1809); Pearson (1896); Fisher (1922); ...



The likelihood fn has at most one maximizer over a log-convex P.



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P)



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P) If the likelihood fn at P, Q ∈ P equals K > 0 then



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P) If the likelihood fn at P, Q ∈ P equals K > 0 then



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P) If the likelihood fn at P, Q ∈ P equals K > 0 then s(P) and s(Q) contain {ω (1) , ..., ω (n) },



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P) If the likelihood fn at P, Q ∈ P equals K > 0 then s(P) and s(Q) contain {ω (1) , ..., ω (n) }, the log-convex combination P t Q 1−t makes sense,



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P) If the likelihood fn at P, Q ∈ P equals K > 0 then s(P) and s(Q) contain {ω (1) , ..., ω (n) }, the log-convex combination P t Q 1−t makes sense, belongs to P



The likelihood fn has at most one maximizer over a log-convex P. (up to the trivial cases when it is identically 0 on P) If the likelihood fn at P, Q ∈ P equals K > 0 then s(P) and s(Q) contain {ω (1) , ..., ω (n) }, the log-convex combination P t Q 1−t makes sense, belongs to P and K=

n hY i=1

n n it h Y i1−t Y (i) P(ω ) Q(ω ) 6 P t Q 1−t (ω (i) ) (i)

i=1

i=1




n hY i=1


i=1

i=1

as the normalizing constant is > 1, tight iff P = Q.




n hY i=1


i=1

i=1

as the normalizing constant is > 1, tight iff P = Q. If P is log-affine (log-convex) then cl(P) has the same property.




n hY i=1


i=1

i=1

as the normalizing constant is > 1, tight iff P = Q. If P is log-affine (log-convex) then cl(P) has the same property. The ml estimate in any closed log-convex set exists and is unique.



For P equal to the ef E µ,f = Q µ,f ,θ : θ ∈ Rd ,



For P equal to the ef E µ,f = Q µ,f ,θ : θ ∈ Rd , the fit between the sample ω (1) , ..., ω (n) and Q µ,f ,θ is rated by



For P equal to the ef E µ,f = Q µ,f ,θ : θ ∈ Rd , the fit between the sample ω (1) , ..., ω (n) and Q µ,f ,θ is rated by n Y i=1

Qµ,f ,θ (ω (i) ) =

n Y i=1

exp hθ, f (ω (i) )i − Λµ,f (θ) · µ(ω (i) ) .



For P equal to the ef E µ,f = Q µ,f ,θ : θ ∈ Rd , the fit between the sample ω (1) , ..., ω (n) and Q µ,f ,θ is rated by n Y

Qµ,f ,θ (ω (i) ) =

i=1

To maximize over θ,

n Y i=1





Qµ,f ,θ (ω (i) ) =

n Y


i=1

To maximize over θ, disregard µ(ω (i) ),




Qµ,f ,θ (ω (i) ) =

n Y


i=1

To maximize over θ, disregard µ(ω (i) ), take ln,




Qµ,f ,θ (ω (i) ) =

n Y


i=1

To maximize over θ, disregard µ(ω (i) ), take ln, and divide by n:




Qµ,f ,θ (ω (i) ) =

n Y


i=1

To maximize over θ, disregard µ(ω (i) ), take ln, and divide by n: a parametric variant of the normalized log-likelihood function θ 7→ hθ, af i − Λµ,f (θ)




Qµ,f ,θ (ω (i) ) =

n Y


i=1

To maximize over θ, disregard µ(ω (i) ), take ln, and divide by n: a parametric variant of the normalized log-likelihood function θ→ 7 hθ, af i − Λµ,f (θ) P where af = n1 ni=1 f (ω (i) ) is the empirical mean of f .




Qµ,f ,θ (ω (i) ) =

n Y


i=1

To maximize over θ, disregard µ(ω (i) ), take ln, and divide by n: a parametric variant of the normalized log-likelihood function θ→ 7 hθ, af i − Λµ,f (θ) P where af = n1 ni=1 f (ω (i) ) is the empirical mean of f . A maximizer θ∗ exists if and only if af ∈ ri(µf ), in which case af equals the Qµ,f ,θ∗ -mean of f . The original likelihood fn has the unique maximizer ∗ Qµ,f ,θ∗ = Rµ,f (af ) .



The mle in cl(Eµ,f ) from the sample with the empirical mean af ∗ (a ). equals Rµ,f f



The mle in cl(Eµ,f ) from the sample with the empirical mean af ∗ (a ). equals Rµ,f f



The mle in cl(Eµ,f ) from the sample with the empirical mean af ∗ (a ). equals Rµ,f f There is a unique face F of cs(µf ) such that af ∈ ri(F ),



The mle in cl(Eµ,f ) from the sample with the empirical mean af ∗ (a ). equals Rµ,f f There is a unique face F of cs(µf ) such that af ∈ ri(F ), then the mle in Eµf −1 (F ) ,f exists uniquely



The mle in cl(Eµ,f ) from the sample with the empirical mean af ∗ (a ). equals Rµ,f f There is a unique face F of cs(µf ) such that af ∈ ri(F ), then the mle in Eµf −1 (F ) ,f exists uniquely and equals R ∗f −1 (F ) (af ) µ

,f



The mle in cl(Eµ,f ) from the sample with the empirical mean af ∗ (a ). equals Rµ,f f There is a unique face F of cs(µf ) such that af ∈ ri(F ), then the mle in Eµf −1 (F ) ,f exists uniquely and equals R ∗f −1 (F ) (af ) µ

,f

∗ (a ). which coincides with Rµ,f f


ef and relative entropy mle in convex ef gmle inequality Main results

The (full, standard) exponential family E



The (full, standard) exponential family E determined by a nonzero Borel measure µ on Rd



The (full, standard) exponential family E determined by a nonzero Borel measure µ on Rd consists of the pm’s Qθ with µ-densities



The (full, standard) exponential family E determined by a nonzero Borel measure µ on Rd consists of the pm’s Qθ with µ-densities dQθ (x) = exp hθ, xi − Λ(θ) dµ



The (full, standard) exponential family E determined by a nonzero Borel measure µ on Rd consists of the pm’s Qθ with µ-densities dQθ (x) = exp hθ, xi − Λ(θ) dµ where

Z Λ(θ) = ln Rd

e hθ,xi µ(dx)




Z Λ(θ) = ln

e hθ,xi µ(dx)

Rd

is the log-Laplace transform of µ




Z Λ(θ) = ln

e hθ,xi µ(dx)

Rd

is the log-Laplace transform of µ and θ ranges over the effective domain of Λ dom(Λ) = {θ : Λ(θ) < +∞} .




Z Λ(θ) = ln

e hθ,xi µ(dx)

Rd

is the log-Laplace transform of µ and θ ranges over the effective domain of Λ dom(Λ) = {θ : Λ(θ) < +∞} . EΞ = {Qθ : θ ∈ Ξ} where Ξ ⊆ dom(Λ) is convex.



The likelihood, given the data x (1) , . . . , x (n) ∈ Rd w.r.t. Qθ dQθ (1) dQθ (n) (x ) . . . (x ) = exp[ hθ, nai − nΛ(θ) ] dµ dµ



The likelihood, given the data x (1) , . . . , x (n) ∈ Rd w.r.t. Qθ dQθ (1) dQθ (n) (x ) . . . (x ) = exp[ hθ, nai − nΛ(θ) ] dµ dµ P where a = n1 ni=1 x (i) is the empirical mean.



The likelihood, given the data x (1) , . . . , x (n) ∈ Rd w.r.t. Qθ dQθ (1) dQθ (n) (x ) . . . (x ) = exp[ hθ, nai − nΛ(θ) ] dµ dµ P where a = n1 ni=1 x (i) is the empirical mean. The maximization of the normalized log-likelihood means ∗ Ψ ∗ (a) = Ψµ,Ξ (a) = sup hθ, ai − Λ(θ) . θ∈Ξ




If a is the mean of some pm Qθ∗ with θ∗ ∈ Ξ then Ψ ∗ (a) − hθ, ai − Λ(θ) = D(Qθ∗ ||Qθ ) , θ∈Ξ.




If a is the mean of some pm Qθ∗ with θ∗ ∈ Ξ then Ψ ∗ (a) − hθ, ai − Λ(θ) = D(Qθ∗ ||Qθ ) , θ∈Ξ. using the relative entropy ( R D(P||Q) =

Rd

dP ln dQ dP

+∞ ,

if P Q otherwise.



(IEEE Trans. IT, June 2003) ∗ (a) exists such that If Ψ ∗ (a) is finite then a unique pm Rµ,Ξ

∗ Ψ ∗ (a) − hθ, ai − Λ(θ) > D(Rµ,Ξ (a)||Qθ ) ,

θ∈Ξ.





θ∈Ξ.





θ∈Ξ.

(a nonconstructive existence proof extends to families of infinite dimension)





θ∈Ξ.

(a nonconstructive existence proof extends to families of infinite dimension) ∗ (a) is called generalized mle for E . The pm R ∗ (a) = Rµ,Ξ Ξ





θ∈Ξ.


If a sequence θn in Ξ satisfies hθn , ai − Λ(θn ) → Ψ ∗ (a) then Qθn → R ∗ (a) in the variation distance.





θ∈Ξ.


If a sequence θn in Ξ satisfies hθn , ai − Λ(θn ) → Ψ ∗ (a) then Qθn → R ∗ (a) in the variation distance. The gmle belongs to cl v (EΞ ), the closure in variation distance (Annals of Probab. 2005).


Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ)



Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ)




Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ) cc(µ) ... the convex core of µ (a special convex subset of cs(µ), containing its relative interior ri(µ))



Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ) cc(µ) ... the convex core of µ (a special convex subset of cs(µ), containing its relative interior ri(µ)) bar (Ξ) ... the barrier cone of Ξ.



Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ) cc(µ) ... the convex core of µ (a special convex subset of cs(µ), containing its relative interior ri(µ)) bar (Ξ) ... the barrier cone of Ξ. Even the instance Ξ = dom(Λ) gives a new formula for dom(Λ∗ ).



Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ) cc(µ) ... the convex core of µ (a special convex subset of cs(µ), containing its relative interior ri(µ)) bar (Ξ) ... the barrier cone of Ξ. Even the instance Ξ = dom(Λ) gives a new formula for dom(Λ∗ ). Since no regularity conditions are imposed the classical convex analysis of mle’s has to be revisited ∗ θ∗ = θµ,Ξ : ri(µ) + bar (Ξ) → dom(Λ)



Theorem dom(Ψ ∗ ) = cc(µ) + bar (Ξ) cc(µ) ... the convex core of µ (a special convex subset of cs(µ), containing its relative interior ri(µ)) bar (Ξ) ... the barrier cone of Ξ. Even the instance Ξ = dom(Λ) gives a new formula for dom(Λ∗ ). Since no regularity conditions are imposed the classical convex analysis of mle’s has to be revisited ∗ θ∗ = θµ,Ξ : ri(µ) + bar (Ξ) → dom(Λ)

to cover the cases when EΞ is overparameterized or a is out of the affine hull of cs(µ).



Theorem For a ∈ ri(µ) + bar (Ξ), the gmle R ∗ (a) equals Qθ∗ (a) ∈ E.



Theorem For a ∈ ri(µ) + bar (Ξ), the gmle R ∗ (a) equals Qθ∗ (a) ∈ E.



Theorem For a ∈ ri(µ) + bar (Ξ), the gmle R ∗ (a) equals Qθ∗ (a) ∈ E. ... this is a revised mle.



Theorem For a ∈ ri(µ) + bar (Ξ), the gmle R ∗ (a) equals Qθ∗ (a) ∈ E. ... this is a revised mle. Theorem ∗ (a) is finite then If Ψµ,Ξ ∗ (a) equals the gmle R ∗ (a) the gmle Rµ,Ξ ν,Ξ where ν is the restriction of µ to cl(G ) ∗ (a) of cc(µ) for a special face G = Gµ,Ξ ∗ and Rν,Ξ (a) obtains by the revisited mle.



Theorem For a ∈ ri(µ) + bar (Ξ), the gmle R ∗ (a) equals Qθ∗ (a) ∈ E. ... this is a revised mle. Theorem ∗ (a) is finite then If Ψµ,Ξ ∗ (a) equals the gmle R ∗ (a) the gmle Rµ,Ξ ν,Ξ where ν is the restriction of µ to cl(G ) ∗ (a) of cc(µ) for a special face G = Gµ,Ξ ∗ and Rν,Ξ (a) obtains by the revisited mle.



Theorem For a ∈ ri(µ) + bar (Ξ), the gmle R ∗ (a) equals Qθ∗ (a) ∈ E. ... this is a revised mle. Theorem ∗ (a) is finite then If Ψµ,Ξ ∗ (a) equals the gmle R ∗ (a) the gmle Rµ,Ξ ν,Ξ where ν is the restriction of µ to cl(G ) ∗ (a) of cc(µ) for a special face G = Gµ,Ξ ∗ and Rν,Ξ (a) obtains by the revisited mle. (a proof by induction on the dimension of aff (µ))


R ∗ : a 7→ R ∗ (a), on dom(Ψ ∗ )




R ∗ : a 7→ R ∗ (a), on dom(Ψ ∗ ) The range of R ∗ consists of the pm’s P ∈ cl v (EΞ ) with means. (assuming Ξ intersects the interior of dom(Λ))



R ∗ : a 7→ R ∗ (a), on dom(Ψ ∗ ) The range of R ∗ consists of the pm’s P ∈ cl v (EΞ ) with means. (assuming Ξ intersects the interior of dom(Λ)) The inverse image {a : R ∗ (a) = P} is a shifted cone. (not necessarily convex)



R ∗ : a 7→ R ∗ (a), on dom(Ψ ∗ ) The range of R ∗ consists of the pm’s P ∈ cl v (EΞ ) with means. (assuming Ξ intersects the interior of dom(Λ)) The inverse image {a : R ∗ (a) = P} is a shifted cone. (not necessarily convex) The gmle mapping is continuous, assuming dom(Ψ ∗ ) has the topology of the graph of Ψ ∗ cl v (EΞ ) has the topology of variation distance.



R ∗ : a 7→ R ∗ (a), on dom(Ψ ∗ ) The range of R ∗ consists of the pm’s P ∈ cl v (EΞ ) with means. (assuming Ξ intersects the interior of dom(Λ)) The inverse image {a : R ∗ (a) = P} is a shifted cone. (not necessarily convex) The gmle mapping is continuous, assuming dom(Ψ ∗ ) has the topology of the graph of Ψ ∗ cl v (EΞ ) has the topology of variation distance. If mle in cl v (EΞ ) exists then it coincides with the gmle for EΞ .


Conjugation ef and relative entropy Maximizing divergence from an EF

The Fenchel conjugate of the log-Laplace transform of µf Λ∗µ,f (a) = sup hθ, ai − Λµ,f (θ) , a ∈ Rd , θ∈Rd

is finite if and only if a ∈ cs(µf ).




is finite if and only if a ∈ cs(µf ).




is finite if and only if a ∈ cs(µf ). For the binomial family, Λ∗µ,f is finite on [0, n] (can be computed explicitly)




is finite if and only if a ∈ cs(µf ). For the binomial family, Λ∗µ,f is finite on [0, n] (can be computed explicitly)

For ε > 0 small Λ∗ (ε) = ε ln ε + ε[−1 − ln n] + o(ε) .


For the family of the positive product measures on Ω = {0, 1}2 ,



For the family of the positive product measures on Ω = {0, 1}2 , the conjugate is finite on a square




For the family of the positive product measures on Ω = {0, 1}2 , the conjugate is finite on a square

By (FM 2007), starting at any boundary point a and moving inside, Λ∗ (a + ε(b − a)) = Λ∗ (a) + C1 · ε ln ε + C2 · ε + o(ε) where the constants C1 , C2 can be explicitly constructed.



The divergence of a pm P from a family E = Eµ,f D(P||E) = inf θ∈Rd D(P||Qθ ) .



The divergence of a pm P from a family E = Eµ,f D(P||E) = inf θ∈Rd D(P||Qθ ) .


D(P||Eµ,f ) = inf θ∈Rd

X h ω∈s(P)


ln

P(ω) µ(ω)

− ln

Qµ,f ,θ (ω) µ(ω)

i

P(ω)



X h


ln

P(ω) µ(ω)

− ln


i

P(ω)

ω∈s(P)

= D(P||µ) + inf θ∈Rd

X h ω∈s(P)

i − ln e hθ,f (ω)i−Λ(θ) P(ω)



X h


ln

P(ω) µ(ω)

− ln


i

P(ω)

ω∈s(P)


X h


ω∈s(P)

= D(P||µ) − supθ∈Rd

hD

θ,

X ω∈s(P)

E i f (ω)P(ω) − Λ(θ)



X h


ln

P(ω) µ(ω)

− ln


i

P(ω)

ω∈s(P)


X h


ω∈s(P)


hD

θ,

X


ω∈s(P)

= D(P||µ) − Λ∗ (E P f )

where E P f =

P

f (ω)P(ω) is the P-mean of f .



X h


ln

P(ω) µ(ω)

− ln


i

P(ω)

ω∈s(P)


X h


ω∈s(P)


hD

θ,

X


ω∈s(P)

= D(P||µ) − Λ∗ (E P f )

where E P f =

P

f (ω)P(ω) is the P-mean of f .

... difference of two convex functions



Nihat Ay’s ideas and results (Annals of Probab. 2002) Maximize D(·||E). This has nice interpretations. First order optimality conditions for a pm P to be a maximizer when E P f is inside the polytope cs(µ).



Nihat Ay’s ideas and results (Annals of Probab. 2002) Maximize D(·||E). This has nice interpretations. First order optimality conditions for a pm P to be a maximizer when E P f is inside the polytope cs(µ). FM 2007 All directional derivatives of D(·||E) at any pm P. All first order optimality conditions.