How to solve (almost) any maximum likelihood problem - Meetup

How to solve (almost) any maximum likelihood problem Ian Fellows Ph.D. Fellows Statistics http://www.fellstat.com

April 16, 2013

Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

April 16, 2013

1 / 47

What we are going to talk about...

Exponential families MLE Problem formulation and basic algorithm Background on MCMC and friends Trust regions for MCMC-MLE Better likelihood approximations An example


April 16, 2013

2 / 47

Intro to Exponential-Families


April 16, 2013

3 / 47

In the Beginning There Were Exponential-Family Distributions...

Let T be random variate with realization t, then the general exponential family model for T is expressed as P(T = t|η) =

1 η·g (t)+o(t) e , c(η)

(1)

where g is a vector valued function generating sufficient statistics for T , o is an offset statistic, and c is the normalizing constant. Z c(η) = e η·g (t)+o(t) (2) t∈N


April 16, 2013

4 / 47


The mean value parameters: µη = Eη (g (T ))


April 16, 2013

5 / 47


Uniform: P(T = t|η) =

Exponential: 1 η(0·t) e c(η)

t ∈ [0, 1] ——————————— Bernoulli: P(T = t|η) = t ∈ {0, 1}

1 ηt e c(η)

P(T = t|η) =

1 ηt e c(η)

t ∈ [0, ∞) ——————————— Multivariate Normal: P(T = t|η) =

1 η1 t+η2 tt 0 e c(η)

t ∈ Rk


April 16, 2013

6 / 47

What is Network Data


April 16, 2013

7 / 47

Network Data

ERGM: P(Y = y |η, X = x) =

1 e η·g ((y ,x))+o((y ,x)) c(η, x)

Gibbs/Markov random field: P(X = x|η, Y = y ) =

1 e η·g ((y ,x))+o((y ,x)) c(η, y )

(NEW!!) Exponential-Family Random Network Model (ERNM): P(Y = y , X = x|η) =

1 η·g ((y ,x))+o((y ,x)) e c(η)


April 16, 2013

8 / 47

Finding the MLE


April 16, 2013

9 / 47

Finding the MLE in any Exponential-Family Distribution: Geyer-Thompson

If t is completely observed the likelihood ratio is: `(η) − `(η0 ) = (η − η0 )·g (t) − log[Eη0 (e (η−η0 )·g (T ) )]. And the first derivative is: δ` = g (t) − Eη (g (T )) δη


April 16, 2013

10 / 47

Finding the MLE in Any Exponential-Family Distribution: Geyer-Thompson

Suppose that we have k samples ti from P(T = t|η0 ), then 1 Z = log[Eη0 (e (η−η0 )·g (T ) )] ≈ Zˆ∞ = log[ k

X

e (η−η0 )g (ti ) ]

i


April 16, 2013

11 / 47



April 16, 2013

12 / 47



April 16, 2013

13 / 47



April 16, 2013

14 / 47



April 16, 2013

15 / 47

Questions: How can we sample from P(T = t|η0 ). What should the initial η0 be?. How far can we trust the sample based approximation of the likelihood. Are there any better approximations to Z .


April 16, 2013

16 / 47

Background toolkit

Some Background on Markov Chain Monte Carlo Methods


April 16, 2013

17 / 47

Background toolkit: MCMC

We need to sample from P(T = t|η), but we can’t solve the integral... Suppose t = (t1 , t2 , ..., tn ) P(Ti = ti |η, t−i ) = where

Z ci (η) =

1 η·g (t)+o(t) e , ci (η)

e η·g (t)+o(t)

ti


April 16, 2013

18 / 47


We can calculate ci . For example, if ti ∈ {0, 1} ci (η) = e η·g (t

− )+o(t − )

+ e η·g (t

+ )+o(t + )

where t + = (t1 , t2 , ..., ti = 1, ..., tn ) and t − = (t1 , t2 , ..., ti = 0, ..., tn ).


April 16, 2013

19 / 47


Okay, so we can sample from P(ti |η, t−i ), but what does that get us? We wanted to sample from P(t|η).


April 16, 2013

20 / 47

Background toolkit: MCMC Gibbs sampling to the rescue... 1 2 3

140 100

120

g(t(j))

160

180

4

Start with t (1) select i from Uniform(1, ..., n) draw t (2) from P(ti |η, t−i ) Rinse and repeat.

0

2000

4000

6000

8000

10000

j


April 16, 2013

21 / 47

Background toolkit: Importance sampling Suppose we have a sample t (i) for i ∈ 1, , k from a distribution p1 and we want to estimate he expectation of a statistic gi (T ). k 1X Ep1 (g (T )) ≈ g (t (i) ) k i

If we want to estimate the expectation for a different distribution p2 we can weight the observations by the ratio of the likelihoods. Ep2 (g (T )) ≈

k X

ωi g (t (i) )

i

where ωi =

p2 (t (i) ) p1 (t (i) ) Pk p2 (t (j) ) j p1 (t (j) )


April 16, 2013

22 / 47

Background toolkit: Importance sampling

if p1 = P(T = t|η0 ) and p2 = P(T = t|η) then

ωi

=

c(η0 ) (η−η0 )·g (t (i) )+o(t (i) ) c(η) e Pk c(η0 ) (η−η )·g (t (j) )+o(t (j) ) 0 j c(η) e

=

e (η−η0 )·g (t )+o(t ) Pk (η−η )·g (t (j) )+o(t (j) ) 0 j e

(i)

(i)


April 16, 2013

23 / 47

Background toolkit: Importance sampling Low Variance 0.4

y

0.3

distribution p1

0.2

p2 0.1 0.0 0

x

5

Higher Variance 0.4

y

0.3

distribution p1

0.2

p2 0.1 0.0 0

x

5

Impossible Variance 0.4

y

0.3

distribution p1

0.2

p2 0.1 0.0 0

x

5


April 16, 2013

24 / 47

100 120 140 160 180

g(t(j))

Background toolkit: Calculating the variance

0

2000

4000

6000

8000

10000

j

divide the sample into b batches of length a. Choose a = Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

√

n.

April 16, 2013

25 / 47

Background toolkit: Calculating the variance

Pjb µ ˆj (η) =

i=(j−1)b+1 ωi g (t Pjb i=(j−1)b+1 ωi

(i) )

for j = 1, . . . , a.

The MCMC batch mean standard error is then defined as v u a u b X σ ˆµ (η) = t (ˆ µj (η) − µ ˆ(η))2 . a−1 j=1


April 16, 2013

26 / 47

How far can we jump at each step of the MCMC-MLE


April 16, 2013

27 / 47

How far to trust Zˆ : Effective Sample Size Restriction

Geyer-Thompson (1992): ||η − η0 || < ergm (2010): `(η) −ˆ `(η0 ) < 20


April 16, 2013

28 / 47

How far to trust Zˆ : Effective Sample Size Restriction If we can not estimate µ(η) well, then our estimate of the likelihood and its first derivative will be poor. Eη (g (T )) ≈ µ ˆη =

X

ωi · g (ti )

i

where ωi are the importance weights e (η−η0 )g (ti )+o(ti ) ωi = P (η−η )g (t )+o(t ) 0 j i je ess(η) ˆ =k

σ ˆµ (η)2 var ˆ (g (T )) k

.

P where var ˆ (h(T )) = k1 ki (g (ti ) − µ ˆ(η0 ))2 . This motivates maximizing the likelihood subject to the constraint that ess(η) ˆ > 4, Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

April 16, 2013

29 / 47

Starting Values


April 16, 2013

30 / 47

Starting values for the MLE algorithm

The log Pseudo-likelihood is defined as `p (η) =

X

log (P(Ti = ti |η, T−i = t−i ))

i

Set the starting values for the MLE algorithm to ηstart = argmax(`p (η)) η


April 16, 2013

31 / 47

Cumulant Generating Function

0

log (E (e η g (T ) )) = log (E (1 + η 0 g (T ) + η 0 g (T )2 /2 + ...)) = log (1 + E (η 0 g (T )) + E ((η 0 g (T ))2 )/2 + ...) = (E (η 0 g (T )) + E ((η 0 g (t))2 )/2 + ...) −

=

(E (η 0 g (T )) + E ((η 0 g (T ))2 )/2 + ...)2 /2 + ... ∞ X κi i

i!

≈ Zˆm =

m X κ ˆi i

(3)

i!

where κi is the ith cumulant of η 0 g (T ), κ ˆ i is the sample cumulant based on the MCMC sample, and η 0 = η − η0 . Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

April 16, 2013

32 / 47


Consider the binomial model P(T = t|η) ∝ e η

Pk i

ti

with k = 780. Suppose that we observe 272 successes and 508 failures, this leads to an MLE of ηˆ = −0.625.


April 16, 2013

33 / 47

−60 −100

l(η) − l(η0)

−20

0


Z −1.5

^ Z3

^ Z2 −1.0

−0.5

^ Z4

^ Z∞

0.0

η

Figure : Estimated log likelihoods with η0 = −0.625 Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

April 16, 2013

34 / 47

10 20 30 40 50 60

l(η) − l(η0)


Z −1.0

−0.8

^ Z3

^ Z2 −0.6

^ Z4 −0.4

^ Z∞ −0.2

η

Figure : Estimated log likelihoods with η0 = −0


April 16, 2013

35 / 47

Cumulant Generating Function Consider the Ising model over a toroidal lattice P(X = x|Y = y , η) ∝ e η

(1)

Pk i

xi +η (2)

Pk

i,j

xi yi,j xj

with k = 16 and data leading to an MLE of ηˆ = (0.2, 0.8). ∑ xiyijxj

∑ xi 150000

Frequency

50000 0

50000 0

Frequency

150000

ij

0

5

10

15


15 20 25 30

April 16, 2013

36 / 47

−0.5 −1.5

l(η) − l(η0)

0.5 1.0


0.6

^ Z3

^ Z2

Z 0.8

1.0

^ Z4 1.2

^ Z∞ 1.4

η

Figure : Estimated profile log likelihoods of η (2) with η (1) = 0.2. Data is 100,000 draws from η0 = (0.2, 1.1). Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

April 16, 2013

37 / 47

−0.5 −1.5 −2.5

l(η) − l(η0)

0.5


^ Z2

Z −0.2

−0.1

0.0

0.1

^ Z3

^ Z4

0.2

0.3

^ Z∞ 0.4

0.5

η

Figure : Estimated profile log likelihoods of η (1) with η (2) = 0.8. Data is 100,000 draws from η0 = (0.2, 1.1). Ian Fellows Ph.D. (Fellows Statistics http://www.fellstat.com) How to solve (almost) any maximum likelihood problem

April 16, 2013

38 / 47

Example

Example


April 16, 2013

39 / 47

Sampson’s Monks

15

17 2

18 16 3

4 1

5

7

1 2 3 4 5 6 7 8 9

Ramauld (L) Bonaventure (L) Ambrose (L) Berthold (L) Peter (L) Louis (L) Victor (L) Winfred (T) John (T)

10 11 12 13 14 15 16 17 18

Gregory (T) Hugh (T) Boniface (T) Mark (T) Albert (T) Amand (O) Basil (O) Elias (O) Simplicius (O)

9

13

14 6

10

12 11

8

Figure : Relationships among monks within a monastery and their affiliations as identified by Sampson: Young (T)urks, (L)oyal Opposition, and (O)utcasts.


April 16, 2013

40 / 47

The Simple Homophily Model

P(T = (y , x)|η) =

Pn 1 η0 Pi,j yi,j +η1 h(y ,x)+Pm−2 i=1 I (xi =l) . l=0 ηj+3 e c(η)

q X X q h(y , x) = di,k (y , x) − E⊥⊥ ( di,k (Y , X )|Y = y , n(X ) = n(x)), k i:xi =k


April 16, 2013

41 / 47

The Simple Homophily Model: Sampson’s Monks

# edges Homophily # in group 1 # in group 2

η -0.55 7.47 0.33 -2.42

se 0.14 0.92 1.32 1.55

z -3.88 8.16 0.25 -1.57

p.value 0.00 0.00 0.80 0.12


April 16, 2013

42 / 47

What about missing Data?


April 16, 2013

43 / 47

The Exponential family in the Case of Missing Data

If that data is Missing At Random: p(Tobs = tobs |η, θ) =

X 1 e η·g (t)+o(t) . c(η) t

(4)

miss

where tobs is the observed part of t, tmiss is the missing part.


April 16, 2013

44 / 47

Finding the MLE in the Case of Missing Data

`(η) − `(η0 ) = log[Eη0 (e (η−η0 )·g (T ) |Tobs = tobs )] − log[Eη0 (e (η−η0 )·g (T ) )]. And the first derivative is: δ` = Eη (g (T )|Tobs = tobs ) − Eη (g (T )) δη To approximate the likelihood we need two MCMC samples. One from P(T = t|η0 ) and one from the conditional distribution P(T = t|Tobs = tobs , η).


April 16, 2013

45 / 47

A New Latent Class Model

Term # of edges Homophily # in group 0 # in group 1

ηˆ -0.58 7.28 -2.50 -0.02

µ ˆ 88.23 15.30 3.95 6.95

s.e.(ˆ η) 0.14 0.91 1.44 1.31

s.e.(ˆ µ) 7.48 1.33 1.08 0.99

Table : Latent Class model for Sampson’s monks.

Simulating from p(X = x|Y = yobs , ηˆ) gives cluster assignments.


April 16, 2013

46 / 47

So there you have it...

Thank you!


April 16, 2013

47 / 47

How to solve (almost) any maximum likelihood problem - Meetup

How to solve (almost) any maximum likelihood problem - Meetup

Suggest Documents

How to Solve Polyhedron Problem?

Maximum likelihood

MAXIMUM LIKELIHOOD ESTIMATION The maximum

Personal Calculator Has Key to Solve Any

Maximum Likelihood Supertrees

Maximum Likelihood Estimation.pdf

Maximum-likelihood Stochastic-transformation

The Maximum Likelihood Degree

Maximum Lq-likelihood estimation

Maximum Kernel Likelihood Estimation

How to Solve PD_penultimate

How to Solve PD_penultimate

maximum likelihood estimation - NYU

How to Solve a Multicriterion Problem for Which Pareto Dominance ...

Expected-Likelihood versus Maximum-Likelihood ... - IEEE Xplore

How to solve the problem of phenomenal unity ...

Phonological rules How to solve a phonology problem Readings ...

How to Solve the 'Hotmail Problem': Understanding ...

Maximum Likelihood Mosaics - arXiv

Approximate Maximum Likelihood Source

maximum likelihood estimation - NYU

MIMODog: How to solve the problem of Selfish ...

THE PROBLEM OF PHOTON GAS: HOW TO SOLVE IT CORRECTLY

How to Solve an Allocation Problem? - Semantic Scholar