Graphical Models - UW-Madison Computer Sciences Department ...

Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison, USA

KDD 2012 Tutorial

(KDD 2012 Tutorial)

Graphical Models

1 / 131

Outline 1

Overview

2

Definitions Directed Graphical Models Undirected Graphical Models Factor Graph

3

Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems

4

Parameter Learning

5

Structure Learning (KDD 2012 Tutorial)

Graphical Models

2 / 131

Overview

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

3 / 131

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100) and a black ball, the other two black balls

(KDD 2012 Tutorial)

Graphical Models

4 / 131

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100) and a black ball, the other two black balls You randomly picked an envelope, randomly took out a ball – it was black

(KDD 2012 Tutorial)

Graphical Models

4 / 131

Overview

The envelope quiz

There are two envelopes, one has a red ball ($100) and a black ball, the other two black balls You randomly picked an envelope, randomly took out a ball – it was black At this point, you are given the option to switch envelopes. Should you?

(KDD 2012 Tutorial)

Graphical Models

4 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b}

(KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2

(KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0

(KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2?

(KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2? (E=1) P (E = 1 | B = b) = P (B=b|E=1)P = 1/2×1/2 = 1/3 P (B=b) 3/4

(KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2? (E=1) P (E = 1 | B = b) = P (B=b|E=1)P = 1/2×1/2 = 1/3 P (B=b) 3/4 Switch.

(KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

The envelope quiz

Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2? (E=1) P (E = 1 | B = b) = P (B=b|E=1)P = 1/2×1/2 = 1/3 P (B=b) 3/4 Switch. The graphical model: E

B (KDD 2012 Tutorial)

Graphical Models

5 / 131

Overview

Reasoning with uncertainty

The world is reduced to a set of random variables x1 , . . . , xn

(KDD 2012 Tutorial)

Graphical Models

6 / 131

Overview


The world is reduced to a set of random variables x1 , . . . , xn e.g. (x1 , . . . , xn−1 ) a feature vector, xn ≡ y the class label

(KDD 2012 Tutorial)

Graphical Models

6 / 131

Overview



Inference: given joint distribution p(x1 , . . . , xn ), compute p(XQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn }

(KDD 2012 Tutorial)

Graphical Models

6 / 131

Overview



Inference: given joint distribution p(x1 , . . . , xn ), compute p(XQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } e.g. Q = {n}, E = {1 . . . n − 1}, by the definition of conditional p(x1 , . . . , xn−1 , xn ) v p(x1 , . . . , xn−1 , xn = v)

p(xn | x1 , . . . , xn−1 ) = P

(KDD 2012 Tutorial)

Graphical Models

6 / 131

Overview



Inference: given joint distribution p(x1 , . . . , xn ), compute p(XQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } e.g. Q = {n}, E = {1 . . . n − 1}, by the definition of conditional p(x1 , . . . , xn−1 , xn ) v p(x1 , . . . , xn−1 , xn = v)

p(xn | x1 , . . . , xn−1 ) = P

Learning: estimate p(x1 , . . . , xn ) from training data X (1) , . . . , X (N ) , (i) (i) where X (i) = (x1 , . . . , xn )

(KDD 2012 Tutorial)

Graphical Models

6 / 131

Overview

It is difficult to reason with uncertainty

joint distribution p(x1 , . . . , xn )

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview


joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.)

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview


joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview



inference p(XQ | XE )

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview



inference p(XQ | XE ) Often can’t afford to do it by brute force

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview




If p(x1 , . . . , xn ) not given, estimate it from data

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview




If p(x1 , . . . , xn ) not given, estimate it from data Often can’t afford to do it by brute force

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Overview




If p(x1 , . . . , xn ) not given, estimate it from data Often can’t afford to do it by brute force

Graphical model: efficient representation, inference, and learning on p(x1 , . . . , xn ), exactly or approximately

(KDD 2012 Tutorial)

Graphical Models

7 / 131

Definitions

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

8 / 131

Definitions

Graphical-Model-Nots

Graphical model is the study of probabilistic models

(KDD 2012 Tutorial)

Graphical Models

9 / 131

Definitions


Graphical model is the study of probabilistic models Just because there are nodes and edges doesn’t mean it’s a graphical model

(KDD 2012 Tutorial)

Graphical Models

9 / 131

Definitions


Graphical model is the study of probabilistic models Just because there are nodes and edges doesn’t mean it’s a graphical model These are not graphical models:

neural network

(KDD 2012 Tutorial)

decision tree

network flow

Graphical Models

HMM template

9 / 131

Definitions

Directed Graphical Models

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

10 / 131

Definitions


Bayesian Network Directed graphical models are also called Bayesian networks

(KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions


Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj

(KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions


Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk

(KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions


Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk A directed acyclic graph (DAG) contains no cycles

(KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions


Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk A directed acyclic graph (DAG) contains no cycles A Bayesian network on the DAG is a family of distributions satisfying Y {p | p(X) = p(xi | P a(xi ))} i

where P a(xi ) is the set of parents of xi .

(KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions



where P a(xi ) is the set of parents of xi . p(xi | P a(xi )) is the conditional probability distribution (CPD) at xi

(KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions



where P a(xi ) is the set of parents of xi . p(xi | P a(xi )) is the conditional probability distribution (CPD) at xi By specifying the CPDs for all i, we specify a particular distribution p(X) (KDD 2012 Tutorial)

Graphical Models

11 / 131

Definitions


Example: Alarm Binary variables P(B)=0.001

P(E)=0.002

E

B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001

A J

P(J | A) = 0.9 P(J | ~A) = 0.05

M P(M | A) = 0.7 P(M | ~A) = 0.01

P (B, ∼ E, A, J, ∼ M ) = P (B)P (∼ E)P (A | B, ∼ E)P (J | A)P (∼ M | A) = 0.001 × (1 − 0.002) × 0.94 × 0.9 × (1 − 0.7) ≈ .000253

(KDD 2012 Tutorial)

Graphical Models

12 / 131

Definitions


Example: Naive Bayes

y

y

x1

...

xd

x d

p(y, x1 , . . . xd ) = p(y)

(KDD 2012 Tutorial)

Qd

i=1 p(xi

| y)

Graphical Models

13 / 131

Definitions



y

y

x1

...

xd

x d

p(y, x1 , . . . xd ) = p(y)

Qd

i=1 p(xi

| y)

Used extensively in natural language processing

(KDD 2012 Tutorial)

Graphical Models

13 / 131

Definitions



y

y

x1

...

xd

x d

p(y, x1 , . . . xd ) = p(y)

Qd

i=1 p(xi

| y)

Used extensively in natural language processing Plate representation on the right

(KDD 2012 Tutorial)

Graphical Models

13 / 131

Definitions


No Causality Whatsoever

P(A)=a P(B|A)=b P(B|~A)=c

A

B

B

A

P(B)=ab+(1−a)c P(A|B)=ab/(ab+(1−a)c) P(A|~B)=a(1−b)/(1−ab−(1−a)c)

The two BNs are equivalent in all respects Bayesian networks imply no causality at all

(KDD 2012 Tutorial)

Graphical Models

14 / 131

Definitions




A

B

B

A


The two BNs are equivalent in all respects Bayesian networks imply no causality at all They only encode the joint probability distribution (hence correlation)

(KDD 2012 Tutorial)

Graphical Models

14 / 131

Definitions




A

B

B

A


The two BNs are equivalent in all respects Bayesian networks imply no causality at all They only encode the joint probability distribution (hence correlation) However, people tend to design BNs based on causal relations

(KDD 2012 Tutorial)

Graphical Models

14 / 131

Definitions


Example: Latent Dirichlet Allocation (LDA)

β

φ

T

D

Nd w

z

θ

α

A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)

Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions



β

φ

T

D

Nd w

z

θ

α


Graphical Models

15 / 131

Definitions


Some Topics by LDA on the Wish Corpus

p(word | topic)

“troops”

(KDD 2012 Tutorial)

“election”

Graphical Models

“love”

16 / 131

Definitions


Conditional Independence

Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent)

(KDD 2012 Tutorial)

Graphical Models

17 / 131

Definitions



Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r.v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent)

(KDD 2012 Tutorial)

Graphical Models

17 / 131

Definitions



Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r.v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent) This extends to groups of r.v.s

(KDD 2012 Tutorial)

Graphical Models

17 / 131

Definitions



Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r.v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent) This extends to groups of r.v.s Conditional independence in a BN is precisely specified by d-separation (“directed separation”)

(KDD 2012 Tutorial)

Graphical Models

17 / 131

Definitions


d-Separation Case 1: Tail-to-Tail

C

C A

A

B

B

A, B in general dependent

(KDD 2012 Tutorial)

Graphical Models

18 / 131

Definitions



C

C A

A

B

B

A, B in general dependent A, B conditionally independent given C (observed nodes are shaded)

(KDD 2012 Tutorial)

Graphical Models

18 / 131

Definitions



C

C A

A

B

B

A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) An observed C is a tail-to-tail node, blocks the undirected path A-B

(KDD 2012 Tutorial)

Graphical Models

18 / 131

Definitions


d-Separation Case 2: Head-to-Tail

A

C

B

A

C

B

A, B in general dependent

(KDD 2012 Tutorial)

Graphical Models

19 / 131

Definitions



A

C

B

A

C

B

A, B in general dependent A, B conditionally independent given C

(KDD 2012 Tutorial)

Graphical Models

19 / 131

Definitions



A

C

B

A

C

B

A, B in general dependent A, B conditionally independent given C An observed C is a head-to-tail node, blocks the path A-B

(KDD 2012 Tutorial)

Graphical Models

19 / 131

Definitions


d-Separation Case 3: Head-to-Head

A

B

A

C

B

C

A, B in general independent

(KDD 2012 Tutorial)

Graphical Models

20 / 131

Definitions



A

B

A

C

B

C

A, B in general independent A, B conditionally dependent given C, or any of C’s descendants

(KDD 2012 Tutorial)

Graphical Models

20 / 131

Definitions



A

B

A

C

B

C

A, B in general independent A, B conditionally dependent given C, or any of C’s descendants An observed C is a head-to-head node, unblocks the path A-B

(KDD 2012 Tutorial)

Graphical Models

20 / 131

Definitions


d-Separation

Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked

(KDD 2012 Tutorial)

Graphical Models

21 / 131

Definitions


d-Separation

Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked A path is blocked if it includes a node x such that either

(KDD 2012 Tutorial)

Graphical Models

21 / 131

Definitions


d-Separation

Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked A path is blocked if it includes a node x such that either The path is head-to-tail or tail-to-tail at x and x ∈ C, or

(KDD 2012 Tutorial)

Graphical Models

21 / 131

Definitions


d-Separation

Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked A path is blocked if it includes a node x such that either The path is head-to-tail or tail-to-tail at x and x ∈ C, or The path is head-to-head at x, and neither x nor any of its descendants is in C.

(KDD 2012 Tutorial)

Graphical Models

21 / 131

Definitions


d-Separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F

A

F

B

E

C (KDD 2012 Tutorial)

Graphical Models

22 / 131

Definitions


d-Separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A, B dependent given C

A

F

B

E


Graphical Models

22 / 131

Definitions


d-Separation Example 2 The path from A to B is blocked both at E and F

A

F

B

E


Graphical Models

23 / 131

Definitions


d-Separation Example 2 The path from A to B is blocked both at E and F A, B conditionally independent given F

A

F

B

E


Graphical Models

23 / 131

Definitions

Undirected Graphical Models

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

24 / 131

Definitions


Markov Random Fields Undirected graphical models are also called Markov Random Fields

(KDD 2012 Tutorial)

Graphical Models

25 / 131

Definitions


Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive

(KDD 2012 Tutorial)

Graphical Models

25 / 131

Definitions


Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!)

(KDD 2012 Tutorial)

Graphical Models

25 / 131

Definitions


Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!) Define a nonnegative potential function ψC : XC 7→ R+

(KDD 2012 Tutorial)

Graphical Models

25 / 131

Definitions


Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!) Define a nonnegative potential function ψC : XC 7→ R+ An undirected graphical model is a family of distributions satisfying ( ) 1 Y p | p(X) = ψC (XC ) Z C

(KDD 2012 Tutorial)

Graphical Models

25 / 131

Definitions


Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!) Define a nonnegative potential function ψC : XC 7→ R+ An undirected graphical model is a family of distributions satisfying ( ) 1 Y p | p(X) = ψC (XC ) Z C

Z=

RQ

C

ψC (XC )dX is the partition function

(KDD 2012 Tutorial)

Graphical Models

25 / 131

Definitions


Example: A Tiny Markov Random Field

x1

x2 C

x1 , x2 ∈ {−1, 1}

(KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions



x1

x2 C

x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2

(KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions



x1

x2 C

x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 p(x1 , x2 ) =

1 ax1 x2 Ze

(KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions



x1

x2 C

x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 1 ax1 x2 Ze e−a + e−a

p(x1 , x2 ) = Z = (ea +

(KDD 2012 Tutorial)

+ ea )

Graphical Models

26 / 131

Definitions



x1

x2 C


p(x1 , x2 ) = Z = (ea +

+ ea )

p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a )

(KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions



x1

x2 C


p(x1 , x2 ) = Z = (ea +

+ ea )

p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a ) p(−1, 1) = p(1, −1) = e−a /(2ea + 2e−a )

(KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions



x1

x2 C


p(x1 , x2 ) = Z = (ea +

+ ea )

p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a ) p(−1, 1) = p(1, −1) = e−a /(2ea + 2e−a ) When the parameter a > 0, favor homogeneous chains

(KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions



x1

x2 C


p(x1 , x2 ) = Z = (ea +

+ ea )

p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a ) p(−1, 1) = p(1, −1) = e−a /(2ea + 2e−a ) When the parameter a > 0, favor homogeneous chains When the parameter a < 0, favor inhomogeneous chains (KDD 2012 Tutorial)

Graphical Models

26 / 131

Definitions


Log Linear Models

Real-valued feature functions f1 (X), . . . , fk (X)

(KDD 2012 Tutorial)

Graphical Models

27 / 131

Definitions


Log Linear Models

Real-valued feature functions f1 (X), . . . , fk (X) Real-valued weights w1 , . . . , wk ! k X 1 p(X) = exp − wi fi (X) Z i=1

(KDD 2012 Tutorial)

Graphical Models

27 / 131

Definitions


Example: The Ising Model θs xs

θst

xt

This is an undirected model with x ∈ {0, 1}.   X X 1 θs xs + pθ (x) = exp  θst xs xt  Z s∈V

(s,t)∈E

fs (X) = xs , fst (X) = xs xt

(KDD 2012 Tutorial)

Graphical Models

28 / 131

Definitions


Example: The Ising Model θs xs

θst

xt

This is an undirected model with x ∈ {0, 1}.   X X 1 θs xs + pθ (x) = exp  θst xs xt  Z s∈V

(s,t)∈E

fs (X) = xs , fst (X) = xs xt ws = −θs , wst = −θst (KDD 2012 Tutorial)

Graphical Models

28 / 131

Definitions


Example: Image Denoising

noisy image argmaxX P (X|Y )   X X 1 pθ (X | Y ) = exp  θs xs + θst xs xt  Z s∈V (s,t)∈E c ys = 1 θs = −c ys = 0

[From Bishop PRML]

(KDD 2012 Tutorial)

Graphical Models

29 / 131

Definitions


Example: Gaussian Random Field

p(X) ∼ N (µ, Σ) =

1 (2π)n/2 |Σ|1/2

1 > −1 exp − (X − µ) Σ (X − µ) 2

Multivariate Gaussian

(KDD 2012 Tutorial)

Graphical Models

30 / 131

Definitions



p(X) ∼ N (µ, Σ) =

1 (2π)n/2 |Σ|1/2

1 > −1 exp − (X − µ) Σ (X − µ) 2

Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite

(KDD 2012 Tutorial)

Graphical Models

30 / 131

Definitions



p(X) ∼ N (µ, Σ) =

1 (2π)n/2 |Σ|1/2

1 > −1 exp − (X − µ) Σ (X − µ) 2

Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite Let Ω = Σ−1 be the precision matrix

(KDD 2012 Tutorial)

Graphical Models

30 / 131

Definitions



p(X) ∼ N (µ, Σ) =

1 (2π)n/2 |Σ|1/2

1 > −1 exp − (X − µ) Σ (X − µ) 2

Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite Let Ω = Σ−1 be the precision matrix xi , xj are conditionally independent given all other variables, if and only if Ωij = 0

(KDD 2012 Tutorial)

Graphical Models

30 / 131

Definitions



p(X) ∼ N (µ, Σ) =

1 (2π)n/2 |Σ|1/2

1 > −1 exp − (X − µ) Σ (X − µ) 2

Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite Let Ω = Σ−1 be the precision matrix xi , xj are conditionally independent given all other variables, if and only if Ωij = 0 When Ωij 6= 0, there is an edge between xi , xj

(KDD 2012 Tutorial)

Graphical Models

30 / 131

Definitions


Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if:

C

A


Graphical Models

31 / 131

Definitions


Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if: A, B become disconnected by removing C and all edges involving C

C

A


Graphical Models

31 / 131

Definitions

Factor Graph

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

32 / 131

Definitions

Factor Graph

Factor Graph For both directed and undirected graphical models

A

B

A

ψ (A,B,C)

f ψ (A,B,C)

C

A

B

C

B

A

B f


P(A)P(B)P(C|A,B)

C Graphical Models

33 / 131

Definitions

Factor Graph

Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node

A

B

A

ψ (A,B,C)

f ψ (A,B,C)

C

A

B

C

B

A

B f


P(A)P(B)P(C|A,B)

C Graphical Models

33 / 131

Definitions

Factor Graph

Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node Factors represent computation A

B

A

ψ (A,B,C)

f ψ (A,B,C)

C

A

B

C

B

A

B f


P(A)P(B)P(C|A,B)

C Graphical Models

33 / 131

Inference

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

34 / 131

Inference

Exact Inference

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

35 / 131

Inference

Exact Inference

Inference by Enumeration

Let X = (XQ , XE , XO ) for query, evidence, and other variables.

(KDD 2012 Tutorial)

Graphical Models

36 / 131

Inference

Exact Inference


Let X = (XQ , XE , XO ) for query, evidence, and other variables. Infer P (XQ | XE )

(KDD 2012 Tutorial)

Graphical Models

36 / 131

Inference

Exact Inference


Let X = (XQ , XE , XO ) for query, evidence, and other variables. Infer P (XQ | XE ) By definition P P (XQ , XE ) X P (XQ , XE , XO ) P (XQ | XE ) = =P O P (XE ) XQ ,XO P (XQ , XE , XO )

(KDD 2012 Tutorial)

Graphical Models

36 / 131

Inference

Exact Inference


Let X = (XQ , XE , XO ) for query, evidence, and other variables. Infer P (XQ | XE ) By definition P P (XQ , XE ) X P (XQ , XE , XO ) P (XQ | XE ) = =P O P (XE ) XQ ,XO P (XQ , XE , XO ) Summing exponential number of terms: with k variables in XO each taking r values, there are rk terms

(KDD 2012 Tutorial)

Graphical Models

36 / 131

Inference

Exact Inference

Details of the summing problem

There are a bunch of “other” variables x1 , . . . , xk

(KDD 2012 Tutorial)

Graphical Models

37 / 131

Inference

Exact Inference


There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take:

(KDD 2012 Tutorial)

Graphical Models

Pvr

xi =v1

37 / 131

Inference

Exact Inference


There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk

(KDD 2012 Tutorial)

Graphical Models

Pvr

xi =v1

37 / 131

Inference

Exact Inference


There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk P We want x1 ...xk p(X)

(KDD 2012 Tutorial)

Graphical Models

Pvr

xi =v1

37 / 131

Inference

Exact Inference



Pvr

xi =v1

For a graphical model, the joint probability factors Q p(X) = m f j=1 j (X(j) )

(KDD 2012 Tutorial)

Graphical Models

37 / 131

Inference

Exact Inference



Pvr

xi =v1

For a graphical model, the joint probability factors Q p(X) = m f j=1 j (X(j) ) Each factor fj operates on a subset X(j) ⊆ X

(KDD 2012 Tutorial)

Graphical Models

37 / 131

Inference

Exact Inference



Pvr

xi =v1

For a graphical model, the joint probability factors Q p(X) = m f j=1 j (X(j) ) Each factor fj operates on a subset X(j) ⊆ X Idea to save computation: dynamic programming

(KDD 2012 Tutorial)

Graphical Models

37 / 131

Inference

Exact Inference

Eliminating a Variable

Rearrange factors

(KDD 2012 Tutorial)

P

x1 ...xk

+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j)

Graphical Models

38 / 131

Inference

Exact Inference


+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm

Rearrange factors

(KDD 2012 Tutorial)

P

x1 ...xk

Graphical Models

38 / 131

Inference

Exact Inference


+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm

Rearrange factors

(KDD 2012 Tutorial)

P

x1 ...xk

Graphical Models

38 / 131

Inference

Exact Inference



Rearrange factors

P

x1 ...xk

− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1

(KDD 2012 Tutorial)

Graphical Models

38 / 131

Inference

Exact Inference



Rearrange factors

P

x1 ...xk

− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1

(KDD 2012 Tutorial)

Graphical Models

38 / 131

Inference

Exact Inference



Rearrange factors

P

x1 ...xk

− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1 − Dynamic programming: compute fm+1 once, use it thereafter

(KDD 2012 Tutorial)

Graphical Models

38 / 131

Inference

Exact Inference



Rearrange factors

P

x1 ...xk

− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1 − Dynamic programming: compute fm+1 once, use it thereafter − Hope: fm+1 contains very few variables

(KDD 2012 Tutorial)

Graphical Models

38 / 131

Inference

Exact Inference



Rearrange factors

P

x1 ...xk

− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1 − Dynamic programming: compute fm+1 once, use it thereafter − Hope: fm+1 contains very few variables

Recursively eliminate other variables in turn

(KDD 2012 Tutorial)

Graphical Models

38 / 131

Inference

Exact Inference

Example: Chain Graph A

B

C

D

Binary variables

(KDD 2012 Tutorial)

Graphical Models

39 / 131

Inference

Exact Inference


B

C

D

Binary variables Say we want P (D) =

(KDD 2012 Tutorial)

P

A,B,C

P (A)P (B|A)P (C|B)P (D|C)

Graphical Models

39 / 131

Inference

Exact Inference


B

C

D


P

A,B,C


Let f1 (A) = P (A). Note f1 is an array (from CDP) of size two: P (A = 0) P (A = 1)

(KDD 2012 Tutorial)

Graphical Models

39 / 131

Inference

Exact Inference


B

C

D


P

A,B,C


Let f1 (A) = P (A). Note f1 is an array (from CDP) of size two: P (A = 0) P (A = 1) f2 (A, B) is a table of size four: P (B = 0|A = 0) P (B = 0|A = 1) P (B = 1|A = 0) P (B = 1|A = 1))

(KDD 2012 Tutorial)

Graphical Models

39 / 131

Inference

Exact Inference


B

C

D


P

A,B,C


Let f1 (A) = P (A). Note f1 is an array (from CDP) of size two: P (A = 0) P (A = 1) f2 (A, B) is a table of size four: P (B = 0|A = 0) P (B = 0|A = 1) P (B = 1|A = 0) P (B = 1|A = 1)) P (B, C)f4 (C, D) = PA,B,C f1 (A)f2 (A, B)f3 P f (B, C)f (C, D)( 4 B,C 3 A f1 (A)f2 (A, B)) (KDD 2012 Tutorial)

Graphical Models

39 / 131

Inference

Exact Inference


B

C

D

f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1)

(KDD 2012 Tutorial)

Graphical Models

40 / 131

Inference

Exact Inference


B

C

D

f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)

(KDD 2012 Tutorial)

Graphical Models

40 / 131

Inference

Exact Inference


B

C

D

f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1) For this example, f5 (B) happens to be P (B)

(KDD 2012 Tutorial)

Graphical Models

40 / 131

Inference

Exact Inference


B

C

D

f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1) For this example, f5 (B) happens to be P (B) P P P B,C f3 (B, C)f4 (C, D)f5 (B) = C f4 (C, D)( B f3 (B, C)f5 (B)), and so on

(KDD 2012 Tutorial)

Graphical Models

40 / 131

Inference

Exact Inference


B

C

D

f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1) For this example, f5 (B) happens to be P (B) P P P B,C f3 (B, C)f4 (C, D)f5 (B) = C f4 (C, D)( B f3 (B, C)f5 (B)), and so on In the end, f7 (D) is represented by P (D = 0), P (D = 1) (KDD 2012 Tutorial)

Graphical Models

40 / 131

Inference

Exact Inference

Example: Chain Graph

A

B

C

D

Computation for P (D): 12 ×, 6 +

(KDD 2012 Tutorial)

Graphical Models

41 / 131

Inference

Exact Inference


A

B

C

D

Computation for P (D): 12 ×, 6 + Enumeration: 48 ×, 14 +

(KDD 2012 Tutorial)

Graphical Models

41 / 131

Inference

Exact Inference


A

B

C

D

Computation for P (D): 12 ×, 6 + Enumeration: 48 ×, 14 + Saving depends on elimination order. Finding optimal order NP-hard; there are heuristic methods.

(KDD 2012 Tutorial)

Graphical Models

41 / 131

Inference

Exact Inference


A

B

C

D

Computation for P (D): 12 ×, 6 + Enumeration: 48 ×, 14 + Saving depends on elimination order. Finding optimal order NP-hard; there are heuristic methods. Saving depends more critically on the graph structure (tree width), may still be intractable

(KDD 2012 Tutorial)

Graphical Models

41 / 131

Inference

Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e

(KDD 2012 Tutorial)

Graphical Models

42 / 131

Inference

Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e Eliminate variables XO = X − XE − XQ

(KDD 2012 Tutorial)

Graphical Models

42 / 131

Inference

Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e Eliminate variables XO = X − XE − XQ The final factor will be the joint f (XQ ) = P (XQ , XE = e)

(KDD 2012 Tutorial)

Graphical Models

42 / 131

Inference

Exact Inference

Handling Evidence

For evidence variables XE , simply plug in their value e Eliminate variables XO = X − XE − XQ The final factor will be the joint f (XQ ) = P (XQ , XE = e) Normalize to answer query: f (XQ ) XQ f (XQ )

P (XQ | XE = e) = P

(KDD 2012 Tutorial)

Graphical Models

42 / 131

Inference

Exact Inference

Summary: Exact Inference

Common methods:

(KDD 2012 Tutorial)

Graphical Models

43 / 131

Inference

Exact Inference


Common methods: Enumeration

(KDD 2012 Tutorial)

Graphical Models

43 / 131

Inference

Exact Inference


Common methods: Enumeration Variable elimination

(KDD 2012 Tutorial)

Graphical Models

43 / 131

Inference

Exact Inference


Common methods: Enumeration Variable elimination Not covered: junction tree (aka clique tree)

(KDD 2012 Tutorial)

Graphical Models

43 / 131

Inference

Exact Inference


Common methods: Enumeration Variable elimination Not covered: junction tree (aka clique tree)

Exact, but often intractable for large graphs

(KDD 2012 Tutorial)

Graphical Models

43 / 131

Inference

Markov Chain Monte Carlo

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

44 / 131

Inference


Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ

(KDD 2012 Tutorial)

Graphical Models

45 / 131

Inference


Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ (1)

(m)

If we can draw samples xQ , . . . xQ estimator is

∼ p(xQ | XE ), an unbiased m

1 X p(XQ = cQ | XE ) ≈ 1(x(i) =c ) Q m Q i=1

(KDD 2012 Tutorial)

Graphical Models

45 / 131

Inference



(m)




The variance of the estimator decreases as O(1/m)

(KDD 2012 Tutorial)

Graphical Models

45 / 131

Inference



(m)




The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(xQ | XE )

(KDD 2012 Tutorial)

Graphical Models

45 / 131

Inference



(m)




The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(xQ | XE ) We discuss two methods: forward sampling and Gibbs sampling (KDD 2012 Tutorial)

Graphical Models

45 / 131

Inference


Forward Sampling: Example P(B)=0.001

P(E)=0.002

E


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

M P(M | A) = 0.7 P(M | ~A) = 0.01

To generate a sample X = (B, E, A, J, M ): 1

Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B=0

(KDD 2012 Tutorial)

Graphical Models

46 / 131

Inference



P(E)=0.002

E


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

M P(M | A) = 0.7 P(M | ~A) = 0.01



2

Sample E ∼ Ber(0.002)

(KDD 2012 Tutorial)

Graphical Models

46 / 131

Inference



P(E)=0.002

E


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

M P(M | A) = 0.7 P(M | ~A) = 0.01



2


3

If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on

(KDD 2012 Tutorial)

Graphical Models

46 / 131

Inference



P(E)=0.002

E


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

M P(M | A) = 0.7 P(M | ~A) = 0.01



2


3


4

If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)

(KDD 2012 Tutorial)

Graphical Models

46 / 131

Inference



P(E)=0.002

E


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

M P(M | A) = 0.7 P(M | ~A) = 0.01



2


3


4

If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)

5

If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01) (KDD 2012 Tutorial)

Graphical Models

46 / 131

Inference


Inference with Forward Sampling

Say the inference task is P (B = 1 | E = 1, M = 1)

(KDD 2012 Tutorial)

Graphical Models

47 / 131

Inference



Say the inference task is P (B = 1 | E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m

p(B = 1 | E = 1, M = 1) ≈

1 X 1(B (i) =1) m i=1

where m is the number of surviving samples

(KDD 2012 Tutorial)

Graphical Models

47 / 131

Inference




p(B = 1 | E = 1, M = 1) ≈

1 X 1(B (i) =1) m i=1

where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny)

(KDD 2012 Tutorial)

Graphical Models

47 / 131

Inference




p(B = 1 | E = 1, M = 1) ≈

1 X 1(B (i) =1) m i=1

where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) Does not work for Markov Random Fields

(KDD 2012 Tutorial)

Graphical Models

47 / 131

Inference


Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.

P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

(KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01

48 / 131

Inference


Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE )

P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

(KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01

48 / 131

Inference


Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models

P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

(KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01

48 / 131

Inference


Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models Initialization:

P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

(KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01

48 / 131

Inference


Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models Initialization: Fix evidence; randomly set other variables

P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

(KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01

48 / 131

Inference


Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models Initialization: Fix evidence; randomly set other variables e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1) P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05

(KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01

48 / 131

Inference


Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i )

P(B)=0.001

P(E)=0.002

E=1


A J

(KDD 2012 Tutorial)

M=1

Models P(M | A) = 0.7 P(J | A)Graphical = 0.9

49 / 131

Inference


Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i ) This is equivalent to xi ∼ P (xi | MarkovBlanket(xi ))

P(B)=0.001

P(E)=0.002

E=1


A J

(KDD 2012 Tutorial)

M=1


49 / 131

Inference


Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i ) This is equivalent to xi ∼ P (xi | MarkovBlanket(xi )) For a Bayesian network MarkovBlanket(xi ) includes xi ’s parents, spouses, and children Y P (xi | MarkovBlanket(xi )) ∝ P (xi | P a(xi )) P (y | P a(y)) y∈C(xi )

where P a(x) are the parents of x, and C(x) the children of x.

P(B)=0.001

P(E)=0.002

E=1


A J

(KDD 2012 Tutorial)

M=1


49 / 131

Inference



where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small.

P(B)=0.001

P(E)=0.002

E=1


A J

(KDD 2012 Tutorial)

M=1


49 / 131

Inference



where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B, E = 1) P(B)=0.001

P(E)=0.002

E=1


A J

(KDD 2012 Tutorial)

M=1


49 / 131

Inference


Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1)

P(B)=0.001

P(E)=0.002

E=1


A J

P(J | A) = 0.9 P(J | ~A) = 0.05 (KDD 2012 Tutorial)

Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131

Inference


Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1) , sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X (2)

P(B)=0.001

P(E)=0.002

E=1


A J


Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131

Inference


Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1) , sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J . . .

P(B)=0.001

P(E)=0.002

E=1


A J


Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131

Inference


Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1) , sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J . . . Keep all later samples. P (B = 1 | E = 1, M = 1) is the fraction of samples with B = 1. P(B)=0.001

P(E)=0.002

E=1


A J


Graphical Models

M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131


Inference

Gibbs Example 2: The Ising Model

A xs

B

D

C

This is an undirected model with x ∈ {0, 1}.   X X 1 θs xs + θst xs xt  pθ (x) = exp  Z s∈V

(KDD 2012 Tutorial)

Graphical Models

(s,t)∈E

51 / 131


Inference

Gibbs Example 2: The Ising Model A B

xs

D

C

The Markov blanket of xs is A, B, C, D

(KDD 2012 Tutorial)

Graphical Models

52 / 131


Inference

Gibbs Example 2: The Ising Model A B

xs

D

C

The Markov blanket of xs is A, B, C, D In general for undirected graphical models p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s.

(KDD 2012 Tutorial)

Graphical Models

52 / 131


Inference

Gibbs Example 2: The Ising Model A xs

B

D

C

The Markov blanket of xs is A, B, C, D In general for undirected graphical models p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s. The Gibbs update is p(xs = 1 | xN (s) ) =

(KDD 2012 Tutorial)

1 exp(−(θs +

Graphical Models

P

t∈N (s) θst xt ))

+1 52 / 131

Inference


Gibbs Sampling as a Markov Chain

A Markov chain is defined by a transition matrix T (X 0 | X)

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference



A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference



A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE )

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference



A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix)

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference



A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference



A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice “burn in”: discard X (0) , . . . , X (T )

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference



A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice “burn in”: discard X (0) , . . . , X (T ) Use all of X (T +1) , . . . for inference (they are correlated); Do not thin

(KDD 2012 Tutorial)

Graphical Models

53 / 131

Inference


Collapsed Gibbs Sampling In general, Ep [f (X)] ≈

(KDD 2012 Tutorial)

1 m

Pm

i=1 f (X

(i) )

Graphical Models

if X (i) ∼ p

54 / 131

Inference



1 m

Pm

i=1 f (X

(i) )

if X (i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations

(KDD 2012 Tutorial)

Graphical Models

54 / 131

Inference



1 m

Pm

i=1 f (X

(i) )

if X (i) ∼ p

Sometimes X = (Y, Z) where Z has closed-form operations If so, Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)] m 1 X ≈ Ep(Z|Y (i) ) [f (Y (i) , Z)] m i=1

if Y (i) ∼ p(Y )

(KDD 2012 Tutorial)

Graphical Models

54 / 131

Inference



1 m

Pm

i=1 f (X

(i) )

if X (i) ∼ p


if Y (i) ∼ p(Y ) No need to sample Z: it is collapsed

(KDD 2012 Tutorial)

Graphical Models

54 / 131

Inference



1 m

Pm

i=1 f (X

(i) )

if X (i) ∼ p


if Y (i) ∼ p(Y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti ((Y−i , yi0 ) | (Y−i , yi )) = p(yi0 | Y−i )

(KDD 2012 Tutorial)

Graphical Models

54 / 131

Inference



1 m

Pm

i=1 f (X

(i) )

if X (i) ∼ p


if Y (i) ∼ p(Y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti ((Y−i , yi0 ) | (Y−i , yi )) = p(yi0 | Y−i ) R Note p(yi0 | Y−i ) = p(yi0 , Z | Y−i )dZ (KDD 2012 Tutorial)

Graphical Models

54 / 131


Inference

Example: Collapsed Gibbs Sampling for LDA φ

β

D

Nd

T

w

z

θ

α

Collapse θ, φ, Gibbs update: (w )

P (zi = j | z−i , w) ∝

(d )

i i n−i,j + βn−i,j +α

(·)

(d )

i n−i,j + W βn−i,· + Tα

(w )

i n−i,j : number of times word wi has been assigned to topic j, excluding the current position

(KDD 2012 Tutorial)

Graphical Models

55 / 131

Inference


Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: (w )

P (zi = j | z−i , w) ∝

(d )


(·)

(d )


(w )


(d )

i n−i,j : number of times a word from document di has been assigned to topic j, excluding the current position

(KDD 2012 Tutorial)

Graphical Models

55 / 131

Inference



P (zi = j | z−i , w) ∝

(d )


(·)

(d )


(w )


(d )


(·)

n−i,j : number of times any word has been assigned to topic j, excluding the current position

(KDD 2012 Tutorial)

Graphical Models

55 / 131

Inference



P (zi = j | z−i , w) ∝

(d )


(·)

(d )


(w )


(d )


(·)

n−i,j : number of times any word has been assigned to topic j, excluding the current position (d )

i n−i,· : length of document di , excluding the current position

(KDD 2012 Tutorial)

Graphical Models

55 / 131

Inference


Summary: Markov Chain Monte Carlo

Forward sampling

(KDD 2012 Tutorial)

Graphical Models

56 / 131

Inference



Forward sampling Gibbs sampling

(KDD 2012 Tutorial)

Graphical Models

56 / 131

Inference



Forward sampling Gibbs sampling Collapsed Gibbs sampling

(KDD 2012 Tutorial)

Graphical Models

56 / 131

Inference



Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc.

(KDD 2012 Tutorial)

Graphical Models

56 / 131

Inference



Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. Unbiased (after burn-in), but can have high variance

(KDD 2012 Tutorial)

Graphical Models

56 / 131

Inference

Belief Propagation

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

57 / 131

Inference

Belief Propagation

The Sum-Product Algorithm

Also known as belief propagation (BP)

(KDD 2012 Tutorial)

Graphical Models

58 / 131

Inference

Belief Propagation


Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as “loopy BP”, approximate

(KDD 2012 Tutorial)

Graphical Models

58 / 131

Inference

Belief Propagation


Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as “loopy BP”, approximate The algorithm involves passing messages on the factor graph

(KDD 2012 Tutorial)

Graphical Models

58 / 131

Inference

Belief Propagation


Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as “loopy BP”, approximate The algorithm involves passing messages on the factor graph Alternative view: variational approximation (more later)

(KDD 2012 Tutorial)

Graphical Models

58 / 131

Inference

Belief Propagation

Example: A Simple HMM

The Hidden Markov Model template (not a graphical model) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

(KDD 2012 Tutorial)

Graphical Models

59 / 131

Inference

Belief Propagation


Observing x1 = R, x2 = G, the directed graphical model

(KDD 2012 Tutorial)

z1

z2

x1=R

x2=G

Graphical Models

60 / 131

Inference

Belief Propagation


Observing x1 = R, x2 = G, the directed graphical model z1

z2

x1=R

x2=G

Factor graph f1 P(z1)P(x1| z1)

(KDD 2012 Tutorial)

z1

f2

z2

P(z2| z1)P(x2| z2)

Graphical Models

60 / 131

Inference

Belief Propagation

Messages

A message is a vector of length K, where K is the number of values x takes.

(KDD 2012 Tutorial)

Graphical Models

61 / 131

Inference

Belief Propagation

Messages

A message is a vector of length K, where K is the number of values x takes. There are two types of messages:

(KDD 2012 Tutorial)

Graphical Models

61 / 131

Inference

Belief Propagation

Messages

A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1

µf →x : message from a factor node f to a variable node x µf →x (i) is the ith element, i = 1 . . . K.

(KDD 2012 Tutorial)

Graphical Models

61 / 131

Inference

Belief Propagation

Messages

A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1

2

µf →x : message from a factor node f to a variable node x µf →x (i) is the ith element, i = 1 . . . K. µx→f : message from a variable node x to a factor node f

(KDD 2012 Tutorial)

Graphical Models

61 / 131

Inference

Belief Propagation

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2 (KDD 2012 Tutorial)

Graphical Models

62 / 131

Inference

Belief Propagation

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2 Start messages at leaves.

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2 (KDD 2012 Tutorial)

Graphical Models

62 / 131

Inference

Belief Propagation

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2 Start messages at leaves. If a leaf is a factor node f , µf →x (x) = f (x) µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2 (KDD 2012 Tutorial)

Graphical Models

62 / 131

Inference

Belief Propagation

Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2 Start messages at leaves. If a leaf is a factor node f , µf →x (x) = f (x) µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8 If a leaf is a variable node x, µx→f (x) = 1 f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2 (KDD 2012 Tutorial)

Graphical Models

62 / 131

Inference

Belief Propagation

Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived

f1

z1

P(z1)P(x1| z1)

(KDD 2012 Tutorial)

f2

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4) Models R G Graphical B R G B

63 / 131

Inference

Belief Propagation

Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived Let x be in factor fs . ne(x)\fs are factors connected to x excluding fs . Y µx→fs (x) = µf →x (x) f ∈ne(x)\fs

µz1 →f2 (z1 = 1) = 1/4 µz1 →f2 (z1 = 2) = 1/8

f1

z1

P(z1)P(x1| z1)

(KDD 2012 Tutorial)

f2

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4) Models R G Graphical B R G B

63 / 131

Inference

Belief Propagation

Message from Factor to Variable Let x be in factor fs . Let the other variables in fs be x1:M . µfs →x (x) =

X

...

x1

f1 P(z1)P(x1| z1) (KDD 2012 Tutorial)

X

fs (x, x1 , . . . , xM )

xM

z1

M Y

µxm →fs (xm )

m=1

f2

z2

P(z2| z1)P(x2| z2) Graphical Models

64 / 131

Inference

Belief Propagation


X

...

x1

X

fs (x, x1 , . . . , xM )

xM

M Y

µxm →fs (xm )

m=1

In this example µf2 →z2 (s) =

2 X

µz1 →f2 (s0 )f2 (z1 = s0 , z2 = s)

s0 =1

= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s) +1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)

f1 P(z1)P(x1| z1) (KDD 2012 Tutorial)

z1

f2

z2


64 / 131

Inference

Belief Propagation


X

...

x1

X

fs (x, x1 , . . . , xM )

xM

M Y

µxm →fs (xm )

m=1

In this example µf2 →z2 (s) =

2 X

µz1 →f2 (s0 )f2 (z1 = s0 , z2 = s)

s0 =1

= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s) +1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s) We get µf2 →z2 (z2 = 1) = 1/32, µf2 →z2 (z2 = 2) = 1/8 f1 P(z1)P(x1| z1) (KDD 2012 Tutorial)

z1

f2

z2


64 / 131

Inference

Belief Propagation

Up to Root, Back Down The message has reached the root, pass it back down µz2 →f2 (z2 = 1) = 1 µz2 →f2 (z2 = 2) = 1

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

(KDD 2012 Tutorial)

Graphical Models

65 / 131

Inference

Belief Propagation

Keep Passing Down P µf2 →z1 (s) = 2s0 =1 µz2 →f2 (s0 )f2 (z1 = s, z2 = s0 ) = 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1) + 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2).

f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

(KDD 2012 Tutorial)

Graphical Models

66 / 131

Inference

Belief Propagation

Keep Passing Down P µf2 →z1 (s) = 2s0 =1 µz2 →f2 (s0 )f2 (z1 = s, z2 = s0 ) = 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1) + 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2). We get µf2 →z1 (z1 = 1) = 7/16 µf2 →z1 (z1 = 2) = 3/8 f1

z1

f2

P(z1)P(x1| z1)

z2

P(z2| z1)P(x2| z2) 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

(KDD 2012 Tutorial)

Graphical Models

66 / 131

Inference

Belief Propagation

From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as Y p(x) ∝ µf →x (x) f ∈ne(x)

(KDD 2012 Tutorial)

Graphical Models

67 / 131

Inference

Belief Propagation


In this example 1/4 P (z1 |x1 , x2 ) ∝ µf1 →z1 · µf2 →z1 = 1/8 · 1/32 P (z2 |x1 , x2 ) ∝ µf2 →z2 = 1/8 ⇒ 0.2 0.8

(KDD 2012 Tutorial)

Graphical Models

7/16 3/8

=

7/64 3/64

⇒

0.7 0.3

67 / 131

Inference

Belief Propagation


In this example 1/4 P (z1 |x1 , x2 ) ∝ µf1 →z1 · µf2 →z1 = 1/8 · 1/32 P (z2 |x1 , x2 ) ∝ µf2 →z2 = 1/8 ⇒ 0.2 0.8

7/16 3/8

=

7/64 3/64

⇒

0.7 0.3

One can also compute the marginal of the set of variables xs involved in a factor fs Y p(xs ) ∝ fs (xs ) µx→f (x) x∈ne(f )

(KDD 2012 Tutorial)

Graphical Models

67 / 131

Inference

Belief Propagation

Handling Evidence Observing x = v,

(KDD 2012 Tutorial)

Graphical Models

68 / 131

Inference

Belief Propagation

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or

(KDD 2012 Tutorial)

Graphical Models

68 / 131

Inference

Belief Propagation

Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx→f (x) = 0 for all x 6= v

(KDD 2012 Tutorial)

Graphical Models

68 / 131

Inference

Belief Propagation


Observing XE ,

(KDD 2012 Tutorial)

Graphical Models

68 / 131

Inference

Belief Propagation


Observing XE , multiplying the incoming messages to x ∈ / XE gives the joint (not p(x|XE )): Y p(x, XE ) ∝ µf →x (x) f ∈ne(x)

(KDD 2012 Tutorial)

Graphical Models

68 / 131

Inference

Belief Propagation


Observing XE , multiplying the incoming messages to x ∈ / XE gives the joint (not p(x|XE )): Y p(x, XE ) ∝ µf →x (x) f ∈ne(x)

The conditional is easily obtained by normalization p(x, XE ) 0 x0 p(x , XE )

p(x|XE ) = P

(KDD 2012 Tutorial)

Graphical Models

68 / 131

Inference

Belief Propagation

Loopy Belief Propagation

So far, we assumed a tree graph

(KDD 2012 Tutorial)

Graphical Models

69 / 131

Inference

Belief Propagation


So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence

(KDD 2012 Tutorial)

Graphical Models

69 / 131

Inference

Belief Propagation


So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence But convergence may not happen

(KDD 2012 Tutorial)

Graphical Models

69 / 131

Inference

Belief Propagation


So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence But convergence may not happen But in many cases loopy BP still works well, empirically

(KDD 2012 Tutorial)

Graphical Models

69 / 131

Inference

Mean Field Algorithm

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

70 / 131

Inference


Example: The Ising Model

θs xs

θst

xt

The random variables x take values in {0, 1}.   X X 1 θs xs + pθ (x) = exp  θst xs xt  Z s∈V

(KDD 2012 Tutorial)

Graphical Models

(s,t)∈E

71 / 131

Inference


The Conditional θs xs

θst

xt

Markovian: the conditional distribution for xs is p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s.

(KDD 2012 Tutorial)

Graphical Models

72 / 131

Inference


The Conditional θs xs

θst

xt

Markovian: the conditional distribution for xs is p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s. This reduces to (recall Gibbs sampling) p(xs = 1 | xN (s) ) =

(KDD 2012 Tutorial)

1 exp(−(θs +

Graphical Models

P


+1

72 / 131

Inference


The Mean Field Algorithm for Ising Model Gibbs sampling would draw xs from p(xs = 1 | xN (s) ) =

(KDD 2012 Tutorial)

1 exp(−(θs +

Graphical Models

P


+1

73 / 131

Inference



1 exp(−(θs +

P


+1

Instead, let µs be the estimated marginal p(xs = 1)

(KDD 2012 Tutorial)

Graphical Models

73 / 131

Inference



1 exp(−(θs +

P


+1

Instead, let µs be the estimated marginal p(xs = 1) Mean field algorithm: µs ←

(KDD 2012 Tutorial)

1 exp(−(θs +

P

t∈N (s) θst µt ))

Graphical Models

+1

73 / 131

Inference



1 exp(−(θs +

P


+1


1 exp(−(θs +

P


+1

The µ’s are updated iteratively

(KDD 2012 Tutorial)

Graphical Models

73 / 131

Inference



1 exp(−(θs +

P


+1


1 exp(−(θs +

P


+1

The µ’s are updated iteratively The Mean Field algorithm is coordinate ascent and guaranteed to converge to a local optimal (more later). (KDD 2012 Tutorial)

Graphical Models

73 / 131

Inference

Exponential Family and Variational Inference

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

74 / 131

Inference


Exponential Family

Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R

(KDD 2012 Tutorial)

Graphical Models

75 / 131

Inference


Exponential Family

Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model

(KDD 2012 Tutorial)

Graphical Models

75 / 131

Inference


Exponential Family

Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model φi (X) sometimes called a feature function

(KDD 2012 Tutorial)

Graphical Models

75 / 131

Inference


Exponential Family

Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model φi (X) sometimes called a feature function Let θ = (θ1 , . . . , θd )> ∈ Rd be canonical parameters.

(KDD 2012 Tutorial)

Graphical Models

75 / 131

Inference


Exponential Family

Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model φi (X) sometimes called a feature function Let θ = (θ1 , . . . , θd )> ∈ Rd be canonical parameters. The exponential family is a family of probability densities: pθ (x) = exp θ> φ(x) − A(θ)

(KDD 2012 Tutorial)

Graphical Models

75 / 131

Inference


Exponential Family pθ (x) = exp θ> φ(x) − A(θ) Inner product between parameters θ and sufficient statistics φ. pθ (x) > pθ (y) > pθ (z) φ( ) x

θ

φ( y)

φ( z)

(KDD 2012 Tutorial)

Graphical Models

76 / 131

Inference



θ

φ( y)

φ( z)

A is the log partition function, Z A(θ) = log exp θ> φ(x) ν(dx)

(KDD 2012 Tutorial)

Graphical Models

76 / 131

Inference



θ

φ( y)

φ( z)

A is the log partition function, Z A(θ) = log exp θ> φ(x) ν(dx) A = log Z (KDD 2012 Tutorial)

Graphical Models

76 / 131

Inference


Minimal vs. Overcomplete Models

Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞}

(KDD 2012 Tutorial)

Graphical Models

77 / 131

Inference



Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞} A minimal exponential family is where the φ’s are linearly independent.

(KDD 2012 Tutorial)

Graphical Models

77 / 131

Inference



Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞} A minimal exponential family is where the φ’s are linearly independent. An overcomplete exponential family is where the φ’s are linearly dependent: ∃α ∈ Rd , α> φ(x) = constant ∀x

(KDD 2012 Tutorial)

Graphical Models

77 / 131

Inference



Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞} A minimal exponential family is where the φ’s are linearly independent. An overcomplete exponential family is where the φ’s are linearly dependent: ∃α ∈ Rd , α> φ(x) = constant ∀x Both minimal and overcomplete representations are useful.

(KDD 2012 Tutorial)

Graphical Models

77 / 131

Inference


Exponential Family Example 1: Bernoulli

p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family!

(KDD 2012 Tutorial)

Graphical Models

78 / 131

Inference



p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family! Can be rewritten as p(x) = exp (x log β + (1 − x) log(1 − β))

(KDD 2012 Tutorial)

Graphical Models

78 / 131

Inference



p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family! Can be rewritten as p(x) = exp (x log β + (1 − x) log(1 − β)) Now in exponential family form with φ1 (x) = x, φ2 (x) = 1 − x, θ1 = log β, θ2 = log(1 − β), and A(θ) = 0.

(KDD 2012 Tutorial)

Graphical Models

78 / 131

Inference



p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family! Can be rewritten as p(x) = exp (x log β + (1 − x) log(1 − β)) Now in exponential family form with φ1 (x) = x, φ2 (x) = 1 − x, θ1 = log β, θ2 = log(1 − β), and A(θ) = 0. Overcomplete: α1 = α2 = 1 makes α> φ(x) = 1 for all x

(KDD 2012 Tutorial)

Graphical Models

78 / 131

Inference



p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ)))

(KDD 2012 Tutorial)

Graphical Models

79 / 131

Inference



p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ))) Minimal exponential family with β φ(x) = x, θ = log 1−β , A(θ) = log(1 + exp(θ)).

(KDD 2012 Tutorial)

Graphical Models

79 / 131

Inference



p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ))) Minimal exponential family with β φ(x) = x, θ = log 1−β , A(θ) = log(1 + exp(θ)). Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are in the exponential family

(KDD 2012 Tutorial)

Graphical Models

79 / 131

Inference



p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ))) Minimal exponential family with β φ(x) = x, θ = log 1−β , A(θ) = log(1 + exp(θ)). Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are in the exponential family But not all (e.g., the Laplace distribution).

(KDD 2012 Tutorial)

Graphical Models

79 / 131


Inference

Exponential Family Example 2: Ising Model

θs xs

θst

xt



 pθ (x) = exp 

X s∈V

X

θs xs +

θst xs xt − A(θ)

(s,t)∈E

Binary random variable xs ∈ {0, 1}

(KDD 2012 Tutorial)

Graphical Models

80 / 131


Inference


θs xs

θst

xt



 pθ (x) = exp 

X s∈V

X

θs xs +


(s,t)∈E

Binary random variable xs ∈ {0, 1} d = |V | + |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>

(KDD 2012 Tutorial)

Graphical Models

80 / 131


Inference


θs xs

θst

xt



 pθ (x) = exp 

X s∈V

X

θs xs +


(s,t)∈E

Binary random variable xs ∈ {0, 1} d = |V | + |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)> This is a regular (Ω = Rd ), minimal exponential family.

(KDD 2012 Tutorial)

Graphical Models

80 / 131

Inference


Exponential Family Example 3: Potts Model θs xs

θst

xt

Similar to Ising model but generalizing xs ∈ {0, . . . , r − 1}.

(KDD 2012 Tutorial)

Graphical Models

81 / 131

Inference



θst

xt

Similar to Ising model but generalizing xs ∈ {0, . . . , r − 1}. Indicator functions fsj (x) = 1 if xs = j and 0 otherwise, and fstjk (x) = 1 if xs = j ∧ xt = k, and 0 otherwise.   X X pθ (x) = exp  θstjk fstjk (x) − A(θ) θsj fsj (x) + sj

(KDD 2012 Tutorial)

stjk

Graphical Models

81 / 131

Inference



θst

xt

 pθ (x) = exp 

 X sj

θsj fsj (x) +

X

θstjk fstjk (x) − A(θ)

stjk

d = r|V | + r2 |E|

(KDD 2012 Tutorial)

Graphical Models

82 / 131

Inference



θst

xt

 pθ (x) = exp 

 X

θsj fsj (x) +

sj

X


stjk

d = r|V | + r2 |E| Regular but overcomplete, because and all x.

(KDD 2012 Tutorial)

Pr−1

Graphical Models

j=0 θsj (x)

= 1 for any s ∈ V

82 / 131

Inference



θst

xt

 pθ (x) = exp 

 X

θsj fsj (x) +

sj

X


stjk

d = r|V | + r2 |E| Regular but overcomplete, because and all x.

Pr−1

j=0 θsj (x)

= 1 for any s ∈ V

The Potts model is a special case where the parameters are tied: θstkk = α, and θstjk = β for j 6= k. (KDD 2012 Tutorial)

Graphical Models

82 / 131

Inference


Important Relationship

For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise

(KDD 2012 Tutorial)

Graphical Models

83 / 131

Inference



For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise The marginal can be obtained via the mean Eθ [φsj (x)] = P (xs = j)

(KDD 2012 Tutorial)

Graphical Models

83 / 131

Inference



For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise The marginal can be obtained via the mean Eθ [φsj (x)] = P (xs = j) Since inference is often about computing the marginal, in this case it is equivalent to computing the mean.

(KDD 2012 Tutorial)

Graphical Models

83 / 131

Inference



For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise The marginal can be obtained via the mean Eθ [φsj (x)] = P (xs = j) Since inference is often about computing the marginal, in this case it is equivalent to computing the mean. We will come back to this point when we discuss variational inference...

(KDD 2012 Tutorial)

Graphical Models

83 / 131

Inference


Mean Parameters Let p be any density (not necessarily in exponential family).

(KDD 2012 Tutorial)

Graphical Models

84 / 131

Inference


Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx

(KDD 2012 Tutorial)

Graphical Models

84 / 131

Inference


Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ}

(KDD 2012 Tutorial)

Graphical Models

84 / 131

Inference


Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ} If µ(1) , µ(2) ∈ M, there must exist p(1) , p(2)

(KDD 2012 Tutorial)

Graphical Models

84 / 131

Inference


Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ} If µ(1) , µ(2) ∈ M, there must exist p(1) , p(2) The convex combinations of p(1) , p(2) leads to another mean parameter in M

(KDD 2012 Tutorial)

Graphical Models

84 / 131

Inference


Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ} If µ(1) , µ(2) ∈ M, there must exist p(1) , p(2) The convex combinations of p(1) , p(2) leads to another mean parameter in M Therefore M is convex (KDD 2012 Tutorial)

Graphical Models

84 / 131

Inference


Example: The First Two Moments

Let φ1 (x) = x, φ2 (x) = x2

(KDD 2012 Tutorial)

Graphical Models

85 / 131

Inference



Let φ1 (x) = x, φ2 (x) = x2 For any p (not necessarily Gaussian) on x, the mean parameters µ = (µ1 , µ2 ) = (E(x), E(x2 ))> .

(KDD 2012 Tutorial)

Graphical Models

85 / 131

Inference



Let φ1 (x) = x, φ2 (x) = x2 For any p (not necessarily Gaussian) on x, the mean parameters µ = (µ1 , µ2 ) = (E(x), E(x2 ))> . Note V(x) = E(x2 ) − E2 (x) = µ2 − µ21 ≥ 0 for any p

(KDD 2012 Tutorial)

Graphical Models

85 / 131

Inference



Let φ1 (x) = x, φ2 (x) = x2 For any p (not necessarily Gaussian) on x, the mean parameters µ = (µ1 , µ2 ) = (E(x), E(x2 ))> . Note V(x) = E(x2 ) − E2 (x) = µ2 − µ21 ≥ 0 for any p M is not R × R+ but rather the subset {(µ1 , µ2 ) | µ2 ≥ µ21 }.

(KDD 2012 Tutorial)

Graphical Models

85 / 131

Inference


The Marginal Polytope

The marginal polytope is defined for discrete xs

(KDD 2012 Tutorial)

Graphical Models

86 / 131

Inference



The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p}

(KDD 2012 Tutorial)

Graphical Models

86 / 131

Inference



The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p} p can be a point mass function on a particular x.

(KDD 2012 Tutorial)

Graphical Models

86 / 131

Inference



The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p} p can be a point mass function on a particular x. In fact any p is a convex combination of such point mass functions.

(KDD 2012 Tutorial)

Graphical Models

86 / 131

Inference



The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p} p can be a point mass function on a particular x. In fact any p is a convex combination of such point mass functions. M = conv{φ(x), ∀x} is a convex hull, called the marginal polytope.

(KDD 2012 Tutorial)

Graphical Models

86 / 131

Inference


Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> .

(KDD 2012 Tutorial)

Graphical Models

87 / 131

Inference


Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> . only 4 different x = (x1 , x2 ).

(KDD 2012 Tutorial)

Graphical Models

87 / 131

Inference


Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> . only 4 different x = (x1 , x2 ). the marginal polytope is M = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}

(KDD 2012 Tutorial)

Graphical Models

87 / 131

Inference



the convex hull is a polytope inside the unit cube.

(KDD 2012 Tutorial)

Graphical Models

87 / 131

Inference



the convex hull is a polytope inside the unit cube. the three coordinates are node marginals µ1 ≡ Ep [x1 = 1], µ2 ≡ Ep [x2 = 1] and edge marginal µ12 ≡ Ep [x1 = x2 = 1], hence the name. (KDD 2012 Tutorial)

Graphical Models

87 / 131

Inference


The Log Partition Function A

For any regular exponential family, A(θ) is convex in θ.

(KDD 2012 Tutorial)

Graphical Models

88 / 131

Inference



For any regular exponential family, A(θ) is convex in θ. Strictly convex for minimal exponential family.

(KDD 2012 Tutorial)

Graphical Models

88 / 131

Inference



For any regular exponential family, A(θ) is convex in θ. Strictly convex for minimal exponential family. Nice property:

(KDD 2012 Tutorial)

∂A(θ) ∂θi

= Eθ [φi (x)]

Graphical Models

88 / 131

Inference



For any regular exponential family, A(θ) is convex in θ. Strictly convex for minimal exponential family. Nice property:

∂A(θ) ∂θi

= Eθ [φi (x)]

Therefore, ∇A = µ, the mean parameters of pθ .

(KDD 2012 Tutorial)

Graphical Models

88 / 131

Inference


Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω

(KDD 2012 Tutorial)

Graphical Models

89 / 131

Inference



Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition.

(KDD 2012 Tutorial)

Graphical Models

89 / 131

Inference



Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ

(KDD 2012 Tutorial)

Graphical Models

89 / 131

Inference



Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ Then A∗ (µ) = −H(pθ(µ) ), the negative entropy.

(KDD 2012 Tutorial)

Graphical Models

89 / 131

Inference



Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ Then A∗ (µ) = −H(pθ(µ) ), the negative entropy. The dual of the dual gives back A: A(θ) = supµ∈M µ> θ − A∗ (µ)

(KDD 2012 Tutorial)

Graphical Models

89 / 131

Inference



Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ Then A∗ (µ) = −H(pθ(µ) ), the negative entropy. The dual of the dual gives back A: A(θ) = supµ∈M µ> θ − A∗ (µ) For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈ M0 by the moment matching conditions µ = Eθ [φ(x)].

(KDD 2012 Tutorial)

Graphical Models

89 / 131

Inference


Example: Conjugate Dual for Bernoulli

Recall the minimal exponential family for Bernoulli with φ(x) = x, A(θ) = log(1 + exp(θ)), Ω = R.

(KDD 2012 Tutorial)

Graphical Models

90 / 131

Inference



Recall the minimal exponential family for Bernoulli with φ(x) = x, A(θ) = log(1 + exp(θ)), Ω = R. By definition A∗ (µ) = sup θµ − log(1 + exp(θ)) θ∈R

(KDD 2012 Tutorial)

Graphical Models

90 / 131

Inference



Recall the minimal exponential family for Bernoulli with φ(x) = x, A(θ) = log(1 + exp(θ)), Ω = R. By definition A∗ (µ) = sup θµ − log(1 + exp(θ)) θ∈R

Taking derivative and solve A∗ (µ) = µ log µ + (1 − µ) log(1 − µ) i.e., the negative entropy.

(KDD 2012 Tutorial)

Graphical Models

90 / 131

Inference


Inference with Variational Representation

A(θ) = sup µ> θ − A∗ (µ) µ∈M

The supremum is attained by µ = Eθ [φ(x)].

(KDD 2012 Tutorial)

Graphical Models

91 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation.

(KDD 2012 Tutorial)

Graphical Models

91 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation. Want to compute the mean parameters µsj ? They are the arg sup to the optimization problem above.

(KDD 2012 Tutorial)

Graphical Models

91 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation. Want to compute the mean parameters µsj ? They are the arg sup to the optimization problem above. This variational representation is exact, not approximate

(KDD 2012 Tutorial)

Graphical Models

91 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation. Want to compute the mean parameters µsj ? They are the arg sup to the optimization problem above. This variational representation is exact, not approximate We will relax it next to derive loopy BP and mean field

(KDD 2012 Tutorial)

Graphical Models

91 / 131

Inference


The Difficulties with Variational Representation

A(θ) = sup µ> θ − A∗ (µ) µ∈M

Difficult to solve even though it is a convex problem

(KDD 2012 Tutorial)

Graphical Models

92 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

Difficult to solve even though it is a convex problem Two issues:

(KDD 2012 Tutorial)

Graphical Models

92 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

Difficult to solve even though it is a convex problem Two issues: Although the marginal polytope M is convex, it can be quite complex (exponential number of vertices)

(KDD 2012 Tutorial)

Graphical Models

92 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M

Difficult to solve even though it is a convex problem Two issues: Although the marginal polytope M is convex, it can be quite complex (exponential number of vertices) The dual function A∗ (µ) usually does not admit an explicit form.

(KDD 2012 Tutorial)

Graphical Models

92 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M


Variational approximation modifies the optimization problem so that it is tractable, at the price of an approximate solution.

(KDD 2012 Tutorial)

Graphical Models

92 / 131

Inference



A(θ) = sup µ> θ − A∗ (µ) µ∈M


Variational approximation modifies the optimization problem so that it is tractable, at the price of an approximate solution. Next, we cast mean field and sum-product algorithms as variational approximations.

(KDD 2012 Tutorial)

Graphical Models

92 / 131

Inference


The Mean Field Method as Variational Approximation

The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form.

(KDD 2012 Tutorial)

Graphical Models

93 / 131

Inference



The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form. Consider the fully disconnected subgraph F = (V, ∅) of the original graph G = (V, E)

(KDD 2012 Tutorial)

Graphical Models

93 / 131

Inference



The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form. Consider the fully disconnected subgraph F = (V, ∅) of the original graph G = (V, E) Set all θi = 0 if φi involves edges

(KDD 2012 Tutorial)

Graphical Models

93 / 131


Inference


The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form. Consider the fully disconnected subgraph F = (V, ∅) of the original graph G = (V, E) Set all θi = 0 if φi involves edges The densities in this sub-family are all fully factorized: Y p(xs ; θs ) pθ (x) = s∈V

(KDD 2012 Tutorial)

Graphical Models

93 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}.

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ).

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example:

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )>

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )> The point mass distribution p(x = (0, 1)> ) = 1 is realized as a limit to the series p(x) = exp(θ1 x1 + θ2 x2 − A(θ)) where θ1 → −∞ and θ2 → ∞.

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )> The point mass distribution p(x = (0, 1)> ) = 1 is realized as a limit to the series p(x) = exp(θ1 x1 + θ2 x2 − A(θ)) where θ1 → −∞ and θ2 → ∞. This series is in F because θ12 = 0.

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )> The point mass distribution p(x = (0, 1)> ) = 1 is realized as a limit to the series p(x) = exp(θ1 x1 + θ2 x2 − A(θ)) where θ1 → −∞ and θ2 → ∞. This series is in F because θ12 = 0. Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).

(KDD 2012 Tutorial)

Graphical Models

94 / 131

Inference


The Geometry of M(F )

[courtesy Nick Bridle]

Because the extreme points of M are in M(F ), if M(F ) were convex, we would have M = M(F ).

(KDD 2012 Tutorial)

Graphical Models

95 / 131

Inference




Because the extreme points of M are in M(F ), if M(F ) were convex, we would have M = M(F ). But in general M(F ) is a true subset of M

(KDD 2012 Tutorial)

Graphical Models

95 / 131

Inference




Because the extreme points of M are in M(F ), if M(F ) were convex, we would have M = M(F ). But in general M(F ) is a true subset of M Therefore, M(F ) is a nonconvex inner set of M

(KDD 2012 Tutorial)

Graphical Models

95 / 131

Inference


The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M

attained by solution to inference problem µ = Eθ [φ(x)].

(KDD 2012 Tutorial)

Graphical Models

96 / 131

Inference



attained by solution to inference problem µ = Eθ [φ(x)]. The mean field method simply replaces M with M(F ) L(θ) =

sup µ> θ − A∗ (µ) µ∈M(F )

(KDD 2012 Tutorial)

Graphical Models

96 / 131

Inference





Obvious L(θ) ≤ A(θ).

(KDD 2012 Tutorial)

Graphical Models

96 / 131

Inference





Obvious L(θ) ≤ A(θ). The original solution µ∗ may not be in M(F )

(KDD 2012 Tutorial)

Graphical Models

96 / 131

Inference





Obvious L(θ) ≤ A(θ). The original solution µ∗ may not be in M(F ) Even if µ∗ ∈ M(F ), may hit local maximum and not find it

(KDD 2012 Tutorial)

Graphical Models

96 / 131

Inference





Obvious L(θ) ≤ A(θ). The original solution µ∗ may not be in M(F ) Even if µ∗ ∈ M(F ), may hit local maximum and not find it Why both? Because A∗ (µ) = −H(pθ (µ)) has a very simple form for M (F ) (KDD 2012 Tutorial)

Graphical Models

96 / 131

Inference


Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)

(KDD 2012 Tutorial)

Graphical Models

97 / 131

Inference


Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1) Fully factorized M(F ) means no edge. µst = µs µt

(KDD 2012 Tutorial)

Graphical Models

97 / 131

Inference


Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1) Fully factorized M(F ) means no edge. µst = µs µt For M(F ), the dual function A∗ (µ) has the simple form X X A∗ (µ) = −H(µs ) = µs log µs + (1 − µs ) log(1 − µs ) s∈V

(KDD 2012 Tutorial)

s∈V

Graphical Models

97 / 131


Inference

Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1) Fully factorized M(F ) means no edge. µst = µs µt For M(F ), the dual function A∗ (µ) has the simple form X X A∗ (µ) = −H(µs ) = µs log µs + (1 − µs ) log(1 − µs ) s∈V

s∈V

Thus the mean field problem is X L(θ) = sup µ> θ − (µs log µs + (1 − µs ) log(1 − µs )) µ∈M(F )

s∈V

 =

max

(µ1 ...µm )∈[0,1]m

(KDD 2012 Tutorial)

 X



θs µs +

s∈V

Graphical Models

X (s,t)∈E

θst µs µt +

X

H(µs )

s∈V

97 / 131

Inference


Example: Mean Field for Ising Model  L(θ) =

max

(µ1 ...µm )∈[0,1]m

 X



θs µs +

s∈V

X (s,t)∈E

θst µs µt +

X

H(µs )

s∈V

Bilinear in µ, not jointly concave

(KDD 2012 Tutorial)

Graphical Models

98 / 131

Inference



max

(µ1 ...µm )∈[0,1]m

 X

 s∈V

θs µs +

X

θst µs µt +

(s,t)∈E

X

H(µs )

s∈V

Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others.

(KDD 2012 Tutorial)

Graphical Models

98 / 131

Inference



max

(µ1 ...µm )∈[0,1]m

 X

 s∈V

θs µs +

X (s,t)∈E

θst µs µt +

X

H(µs )

s∈V

Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others. Iterative coordinate-wise maximization: fixing µt for t 6= s and optimizing µs .

(KDD 2012 Tutorial)

Graphical Models

98 / 131

Inference



 X

max

(µ1 ...µm )∈[0,1]m



θs µs +

s∈V

X (s,t)∈E

θst µs µt +

X

H(µs )

s∈V

Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others. Iterative coordinate-wise maximization: fixing µt for t 6= s and optimizing µs . Setting the partial derivative w.r.t. µs to 0 yields: µs =

1 P 1 + exp −(θs + (s,t)∈E θst µt )

as we’ve seen before.

(KDD 2012 Tutorial)

Graphical Models

98 / 131

Inference



max

 X

(µ1 ...µm )∈[0,1]m



θs µs +

s∈V

X (s,t)∈E

θst µs µt +

X

H(µs )

s∈V

Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others. Iterative coordinate-wise maximization: fixing µt for t 6= s and optimizing µs . Setting the partial derivative w.r.t. µs to 0 yields: µs =

1 P 1 + exp −(θs + (s,t)∈E θst µt )

as we’ve seen before. Caution: mean field converges to a local maximum depending on the initialization of µ1 . . . µm . (KDD 2012 Tutorial)

Graphical Models

98 / 131

Inference


The Sum-Product Algorithm as Variational Approximation

A(θ) = sup µ> θ − A∗ (µ) µ∈M

The sum-product algorithm makes two approximations: it relaxes M to an outer set L

(KDD 2012 Tutorial)

Graphical Models

99 / 131

Inference


The Sum-Product Algorithm as Variational Approximation

A(θ) = sup µ> θ − A∗ (µ) µ∈M

The sum-product algorithm makes two approximations: it relaxes M to an outer set L it replaces the dual A∗ with an approximation A˜∗ . A(θ) = sup µ> θ − A˜∗ (µ) µ∈L

(KDD 2012 Tutorial)

Graphical Models

99 / 131

Inference


The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k).

(KDD 2012 Tutorial)

Graphical Models

100 / 131

Inference


The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k). The marginal polytope is M = {µ | ∃p with marginals µ}.

(KDD 2012 Tutorial)

Graphical Models

100 / 131

Inference


The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k). The marginal polytope is M = {µ | ∃p with marginals µ}. Now consider τ ∈ Rd+ satisfying “node normalization” and “edge-node marginal consistency” conditions: r−1 X

τsj = 1

∀s ∈ V

j=0 r−1 X k=0 r−1 X

τstjk = τsj

∀s, t ∈ V, j = 0 . . . r − 1

τstjk = τtk

∀s, t ∈ V, k = 0 . . . r − 1

j=0

(KDD 2012 Tutorial)

Graphical Models

100 / 131

Inference


The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k). The marginal polytope is M = {µ | ∃p with marginals µ}. Now consider τ ∈ Rd+ satisfying “node normalization” and “edge-node marginal consistency” conditions: r−1 X

τsj = 1

∀s ∈ V

j=0 r−1 X k=0 r−1 X

τstjk = τsj

∀s, t ∈ V, j = 0 . . . r − 1

τstjk = τtk

∀s, t ∈ V, k = 0 . . . r − 1

j=0

Define L = {τ satisfying the above conditions}. (KDD 2012 Tutorial)

Graphical Models

100 / 131

Inference


The Outer Relaxation If the graph is a tree then M = L

(KDD 2012 Tutorial)

Graphical Models

101 / 131

Inference


The Outer Relaxation If the graph is a tree then M = L If the graph has cycles then M ⊂ L: L does not enforce other constraints that true marginals need to satisfy

(KDD 2012 Tutorial)

Graphical Models

101 / 131

Inference


The Outer Relaxation If the graph is a tree then M = L If the graph has cycles then M ⊂ L: L does not enforce other constraints that true marginals need to satisfy Nice property: L is still a polytope, but much simpler than M. τ

L

M

φ(x)

(KDD 2012 Tutorial)

Graphical Models

101 / 131

Inference


The Outer Relaxation If the graph is a tree then M = L If the graph has cycles then M ⊂ L: L does not enforce other constraints that true marginals need to satisfy Nice property: L is still a polytope, but much simpler than M. τ

L

M

φ(x)

The first approximation in sum-product is to replace M with L. (KDD 2012 Tutorial)

Graphical Models

101 / 131

Inference


Approximating A∗

Recall µ are node and edge marginals

(KDD 2012 Tutorial)

Graphical Models

102 / 131

Inference


Approximating A∗

Recall µ are node and edge marginals If the graph is a tree, one can exactly reconstruct the joint probability Y Y µstx x s t pµ (x) = µsxs µsxs µtxt s∈V

(KDD 2012 Tutorial)

(s,t)∈E

Graphical Models

102 / 131

Inference


Approximating A∗


(s,t)∈E

If the graph is a tree, the entropy of the joint distribution is H(pµ ) = −

r−1 XX

µsj log µsj −

s∈V j=0

(KDD 2012 Tutorial)

X X (s,t)∈E j,k

Graphical Models

µstjk log

µstjk µsj µtk

102 / 131

Inference


Approximating A∗


(s,t)∈E

If the graph is a tree, the entropy of the joint distribution is H(pµ ) = −

r−1 XX

µsj log µsj −

s∈V j=0

X X (s,t)∈E j,k

µstjk log

µstjk µsj µtk

Neither holds for graph with cycles.

(KDD 2012 Tutorial)

Graphical Models

102 / 131

Inference


Approximating A∗

Define the Bethe entropy for τ ∈ L on loopy graphs in the same way: HBethe (pτ ) = −

r−1 XX

τsj log τsj −

s∈V j=0

(KDD 2012 Tutorial)

Graphical Models

X X (s,t)∈E j,k

τstjk log

τstjk τsj τtk

103 / 131

Inference


Approximating A∗


r−1 XX

τsj log τsj −

s∈V j=0

X X (s,t)∈E j,k

τstjk log

τstjk τsj τtk

Note τ is not a true marginal, and HBethe is not a true entropy.

(KDD 2012 Tutorial)

Graphical Models

103 / 131

Inference


Approximating A∗


r−1 XX

τsj log τsj −

s∈V j=0

X X (s,t)∈E j,k

τstjk log

τstjk τsj τtk

Note τ is not a true marginal, and HBethe is not a true entropy. The second approximation in sum-product is to replace A∗ (τ ) with −HBethe (pτ ).

(KDD 2012 Tutorial)

Graphical Models

103 / 131

Inference


The Sum-Product Algorithm as Variational Approximation With these two approximations, we arrive at the variational problem ˜ A(θ) = sup τ > θ + HBethe (pτ ) τ ∈L

Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L.

(KDD 2012 Tutorial)

Graphical Models

104 / 131

Inference



Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L. The sum-product algorithm can be derived as an iterative fixed point procedure to satisfy optimality conditions.

(KDD 2012 Tutorial)

Graphical Models

104 / 131

Inference



Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L. The sum-product algorithm can be derived as an iterative fixed point procedure to satisfy optimality conditions. ˜ At the solution, A(θ) is not guaranteed to be either an upper or a lower bound of A(θ)

(KDD 2012 Tutorial)

Graphical Models

104 / 131

Inference



Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L. The sum-product algorithm can be derived as an iterative fixed point procedure to satisfy optimality conditions. ˜ At the solution, A(θ) is not guaranteed to be either an upper or a lower bound of A(θ) τ may not correspond to a true marginal distribution

(KDD 2012 Tutorial)

Graphical Models

104 / 131

Inference


Summary: Variational Inference

The sum-product algorithm (loopy belief propagation)

(KDD 2012 Tutorial)

Graphical Models

105 / 131

Inference



The sum-product algorithm (loopy belief propagation) The mean field method

(KDD 2012 Tutorial)

Graphical Models

105 / 131

Inference



The sum-product algorithm (loopy belief propagation) The mean field method Not covered: Expectation Propagation

(KDD 2012 Tutorial)

Graphical Models

105 / 131

Inference



The sum-product algorithm (loopy belief propagation) The mean field method Not covered: Expectation Propagation Efficient computation. But often unknown bias in solution.

(KDD 2012 Tutorial)

Graphical Models

105 / 131

Inference

Maximizing Problems

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

106 / 131

Inference

Maximizing Problems

Maximizing Problems Recall the HMM example 1/4

1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )

(KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems


1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2


We can define “best” as zn∗ = arg maxk p(zn = k|x1:N )

(KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems


1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2


We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best

(KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems


1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2


We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best ∗ In fact z1:N can even have zero probability!

(KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems


1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2



2

An alternative is to find ∗ z1:N = arg max p(z1:N |x1:N ) z1:N

(KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems


1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2



2


finds the most likely state configuration as a whole (KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems


1/2

1

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2



2


finds the most likely state configuration as a whole The max-sum algorithm solves this, generalizes the Viterbi algorithm for HMMs (KDD 2012 Tutorial)

Graphical Models

107 / 131

Inference

Maximizing Problems

Intermediate: The Max-Product Algorithm

Simple modification to the sum-product algorithm: replace the factor-to-variable messages. µfs →x (x) = max . . . max fs (x, x1 , . . . , xM ) x1

xM

Y

µxm →fs (xm ) =

M Y

P

with max in

µxm →fs (xm )

m=1

µf →xm (xm )

f ∈ne(xm )\fs

µx

leaf →f

µf

leaf →x

(x) = 1 (x) = f (x)

(KDD 2012 Tutorial)

Graphical Models

108 / 131

Inference

Maximizing Problems


As in sum-product, pick an arbitrary variable node x as the root

(KDD 2012 Tutorial)

Graphical Models

109 / 131

Inference

Maximizing Problems


As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root

(KDD 2012 Tutorial)

Graphical Models

109 / 131

Inference

Maximizing Problems


As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root Unlike sum-product, do not pass messages back from root to leaves

(KDD 2012 Tutorial)

Graphical Models

109 / 131

Inference

Maximizing Problems


As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root Unlike sum-product, do not pass messages back from root to leaves At the root, multiply incoming messages   Y pmax = max  µf →x (x) x

(KDD 2012 Tutorial)

f ∈ne(x)

Graphical Models

109 / 131

Inference

Maximizing Problems


As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root Unlike sum-product, do not pass messages back from root to leaves At the root, multiply incoming messages   Y pmax = max  µf →x (x) x

f ∈ne(x)

This is the probability of the most likely state configuration

(KDD 2012 Tutorial)

Graphical Models

109 / 131

Inference

Maximizing Problems


To identify the configuration itself, keep back pointers:

(KDD 2012 Tutorial)

Graphical Models

110 / 131

Inference

Maximizing Problems


To identify the configuration itself, keep back pointers: When creating the message µfs →x (x) = max . . . max fs (x, x1 , . . . , xM ) x1

xM

M Y

µxm →fs (xm )

m=1

for each x value, we separately create M pointers back to the values of x1 , . . . , xM that achieve the maximum.

(KDD 2012 Tutorial)

Graphical Models

110 / 131

Inference

Maximizing Problems


To identify the configuration itself, keep back pointers: When creating the message µfs →x (x) = max . . . max fs (x, x1 , . . . , xM ) x1

xM

M Y

µxm →fs (xm )

m=1

for each x value, we separately create M pointers back to the values of x1 , . . . , xM that achieve the maximum. At the root, backtrack the pointers.

(KDD 2012 Tutorial)

Graphical Models

110 / 131

Inference

Maximizing Problems


1/4

f1

z1

f2

1/2

z2 1

P(z1)P(x1| z1)

P(z2| z1)P(x2| z2)

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

Message from leaf f1 µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8

(KDD 2012 Tutorial)

Graphical Models

111 / 131

Inference

Maximizing Problems


1/4

f1

z1

f2

1/2

z2 1

P(z1)P(x1| z1)

P(z2| z1)P(x2| z2)

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

Message from leaf f1 µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8 The second message µz1 →f2 (z1 = 1) = 1/4 µz1 →f2 (z1 = 2) = 1/8

(KDD 2012 Tutorial)

Graphical Models

111 / 131

Inference

Maximizing Problems


1/4

f1

z1

f2

1/2

z2 1

P(z1)P(x1| z1)

P(z2| z1)P(x2| z2)

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

µf2 →z2 (z2 = 1) = max f2 (z1 , z2 )µz1 →f2 (z1 ) z1

= max P (z2 = 1 | z1 )P (x2 = G | z2 = 1)µz1 →f2 (z1 ) z1

= max(1/4 · 1/4 · 1/4, 1/2 · 1/4 · 1/8) = 1/64 Back pointer for z2 = 1: either z1 = 1 or z1 = 2

(KDD 2012 Tutorial)

Graphical Models

112 / 131

Inference

Maximizing Problems

Intermediate: The Max-Product Algorithm 1/4

f1

z1

f2

1/2

z2 1

P(z1)P(x1| z1)

P(z2| z1)P(x2| z2)

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

The other element of the same message: µf2 →z2 (z2 = 2) = max f2 (z1 , z2 )µz1 →f2 (z1 ) z1

= max P (z2 = 2 | z1 )P (x2 = G | z2 = 2)µz1 →f2 (z1 ) z1

= max(3/4 · 1/2 · 1/4, 1/2 · 1/2 · 1/8) = 3/32 Back pointer for z2 = 2: z1 = 1 (KDD 2012 Tutorial)

Graphical Models

113 / 131

Inference

Maximizing Problems

Intermediate: The Max-Product Algorithm 1/4

f1

z1

f2

1/2

z2 1

P(z1)P(x1| z1)

P(z2| z1)P(x2| z2)

2

P(x | z=1)=(1/2, 1/4, 1/4) R G B

P(x | z=2)=(1/4, 1/2, 1/4) R G B

π 1 = π 2 = 1/2

µf2 →z2 =

1/64 →z1 =1,2 3/32 →z1 =1

At root z2 , max µf2 →z2 (s) = 3/32

s=1,2

z2 = 2 → z1 = 1 ∗ z1:2 = arg max p(z1:2 |x1:2 ) = (1, 2) z1:2

In this example, sum-product and max-product produce the same best sequence; In general they differ. (KDD 2012 Tutorial)

Graphical Models

114 / 131

Inference

Maximizing Problems

From Max-Product to Max-Sum The max-sum algorithm is equivalent to the max-product algorithm, but work in log space to avoid underflow. µfs →x (x) =

max log fs (x, x1 , . . . , xM ) +

x1 ...xM

µxm →fs (xm )

m=1

X

µxm →fs (xm ) =

M X

µf →xm (xm )

f ∈ne(xm )\fs

µx

leaf →f

(x) = 0

µf

(x) = log f (x) leaf →x When at the root,  log pmax = max  x

 X

µf →x (x)

f ∈ne(x)

The back pointers are the same. (KDD 2012 Tutorial)

Graphical Models

115 / 131

Parameter Learning

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

116 / 131

Parameter Learning

Parameter Learning

Assume the graph structure is given

(KDD 2012 Tutorial)

Graphical Models

117 / 131

Parameter Learning

Parameter Learning

Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn .

(KDD 2012 Tutorial)

Graphical Models

117 / 131

Parameter Learning

Parameter Learning

Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood

(KDD 2012 Tutorial)

Graphical Models

117 / 131

Parameter Learning

Parameter Learning

Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood Distinguish two cases:

(KDD 2012 Tutorial)

Graphical Models

117 / 131

Parameter Learning

Parameter Learning

Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood Distinguish two cases: fully observed data: all dimensions of x are observed

(KDD 2012 Tutorial)

Graphical Models

117 / 131

Parameter Learning

Parameter Learning

Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood Distinguish two cases: fully observed data: all dimensions of x are observed partially observed data: some dimensions of x are unobserved.

(KDD 2012 Tutorial)

Graphical Models

117 / 131

Parameter Learning

Fully Observed Data pθ (x) = exp θ> φ(x) − A(θ) Given iid data x1 . . . xn , the log likelihood is ! n n X 1 1X `(θ) = log pθ (xi ) = θ> φ(xi ) − A(θ) = θ> µ ˆ − A(θ) n n i=1

(KDD 2012 Tutorial)

i=1

Graphical Models

118 / 131

Parameter Learning


i=1

P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M.

(KDD 2012 Tutorial)

Graphical Models

118 / 131

Parameter Learning


i=1

P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M. Maximum likelihood: θM L = arg supθ∈Ω θ> µ ˆ − A(θ)

(KDD 2012 Tutorial)

Graphical Models

118 / 131

Parameter Learning


i=1

P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M. Maximum likelihood: θM L = arg supθ∈Ω θ> µ ˆ − A(θ) The solution is θM L = θ(ˆ µ), the exponential family density whose mean parameter matches µ ˆ.

(KDD 2012 Tutorial)

Graphical Models

118 / 131

Parameter Learning


i=1

P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M. Maximum likelihood: θM L = arg supθ∈Ω θ> µ ˆ − A(θ) The solution is θM L = θ(ˆ µ), the exponential family density whose mean parameter matches µ ˆ. When µ ˆ ∈ M0 and φ minimal, there is a unique maximum likelihood solution θM L . (KDD 2012 Tutorial)

Graphical Models

118 / 131

Parameter Learning

Partially Observed Data

Each item (x, z) where x observed, z unobserved

(KDD 2012 Tutorial)

Graphical Models

119 / 131

Parameter Learning


Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn

(KDD 2012 Tutorial)

Graphical Models

119 / 131

Parameter Learning


Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz

(KDD 2012 Tutorial)

Graphical Models

119 / 131

Parameter Learning


Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz P Can be written as `(θ) = n1 ni=1 Axi (θ) − A(θ)

(KDD 2012 Tutorial)

Graphical Models

119 / 131

Parameter Learning


Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz P Can be written as `(θ) = n1 ni=1 Axi (θ) − A(θ) New log partition function of pθ (z | xi ), one per item: Z Axi (θ) = log exp(θ> φ(xi , z0 ))dz0

(KDD 2012 Tutorial)

Graphical Models

119 / 131

Parameter Learning


Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz P Can be written as `(θ) = n1 ni=1 Axi (θ) − A(θ) New log partition function of pθ (z | xi ), one per item: Z Axi (θ) = log exp(θ> φ(xi , z0 ))dz0 Expectation-Maximization (EM) algorithm: lower bound Axi

(KDD 2012 Tutorial)

Graphical Models

119 / 131

Parameter Learning

EM as Variational Lower Bound Mean parameter realizable by any distribution on z while holding xi fixed: Mxi = {µ ∈ Rd | µ = Ep [φ(xi , z)] for some p}

(KDD 2012 Tutorial)

Graphical Models

120 / 131

Parameter Learning

EM as Variational Lower Bound Mean parameter realizable by any distribution on z while holding xi fixed: Mxi = {µ ∈ Rd | µ = Ep [φ(xi , z)] for some p} The variational definition Axi (θ) = supµ∈Mx θ> µ − A∗xi (µ) i

(KDD 2012 Tutorial)

Graphical Models

120 / 131

Parameter Learning


Trivial variational lower bound: Axi (θ) ≥ θ> µi − A∗xi (µi ), ∀µi ∈ Mxi

(KDD 2012 Tutorial)

Graphical Models

120 / 131

Parameter Learning


Trivial variational lower bound: Axi (θ) ≥ θ> µi − A∗xi (µi ), ∀µi ∈ Mxi Lower bound L on the incomplete log likelihood: n

`(θ) =

1X Axi (θ) − A(θ) n i=1

≥

n 1 X > i ∗ i θ µ − Axi (µ ) − A(θ) n i=1

≡ L(µ1 , . . . , µn , θ)

(KDD 2012 Tutorial)

Graphical Models

120 / 131

Parameter Learning

Exact EM: The E-Step

The EM algorithm is coordinate ascent on L(µ1 , . . . , µn , θ). In the E-step, maximizes each µi µi ← arg max L(µ1 , . . . , µn , θ) µi ∈Mxi

(KDD 2012 Tutorial)

Graphical Models

121 / 131

Parameter Learning



Equivalently, argmaxµi ∈Mx θ> µi − A∗xi (µi ) i

(KDD 2012 Tutorial)

Graphical Models

121 / 131

Parameter Learning




This is the variational representation of the mean parameter µi (θ) = Eθ [φ(xi , z)]

(KDD 2012 Tutorial)

Graphical Models

121 / 131

Parameter Learning




This is the variational representation of the mean parameter µi (θ) = Eθ [φ(xi , z)] The E-step is named after this Eθ [] under the current parameters θ

(KDD 2012 Tutorial)

Graphical Models

121 / 131

Parameter Learning

Exact EM: The M-Step

In the M-step, maximize θ holding the µ’s fixed: θ ← arg max L(µ1 , . . . , µn , θ) = arg max θ> µ ˆ − A(θ) θ∈Ω

(KDD 2012 Tutorial)

θ∈Ω

Graphical Models

122 / 131

Parameter Learning



µ ˆ=

1 n

θ∈Ω

Pn

i i=1 µ

(KDD 2012 Tutorial)

Graphical Models

122 / 131

Parameter Learning



µ ˆ=

1 n

θ∈Ω

Pn

i i=1 µ

The solution θ(ˆ µ) satisfies Eθ(ˆµ) [φ(x)] = µ ˆ

(KDD 2012 Tutorial)

Graphical Models

122 / 131

Parameter Learning



µ ˆ=

1 n

θ∈Ω

Pn

i i=1 µ

The solution θ(ˆ µ) satisfies Eθ(ˆµ) [φ(x)] = µ ˆ Standard fully observed maximum likelihood problem, hence the name M-step

(KDD 2012 Tutorial)

Graphical Models

122 / 131

Parameter Learning

Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )

µi ∈Mxi

(KDD 2012 Tutorial)

Graphical Models

123 / 131

Parameter Learning


µi ∈Mxi

Improve but not necessarily maximize: “generalized EM”

(KDD 2012 Tutorial)

Graphical Models

123 / 131

Parameter Learning


µi ∈Mxi

Improve but not necessarily maximize: “generalized EM” The mean field method maximizes max

µi ∈Mxi (F )

(KDD 2012 Tutorial)

θ> µi − A∗xi (µi )

Graphical Models

123 / 131

Parameter Learning


µi ∈Mxi


µi ∈Mxi (F )


up to local maximum

(KDD 2012 Tutorial)

Graphical Models

123 / 131

Parameter Learning


µi ∈Mxi


µi ∈Mxi (F )


up to local maximum recall Mxi (F ) is an inner approximation to Mxi

(KDD 2012 Tutorial)

Graphical Models

123 / 131

Parameter Learning


µi ∈Mxi


µi ∈Mxi (F )



Mean field E-step leads to generalized EM

(KDD 2012 Tutorial)

Graphical Models

123 / 131

Parameter Learning


µi ∈Mxi


µi ∈Mxi (F )



Mean field E-step leads to generalized EM The sum-product algorithm does not lead to generalized EM (KDD 2012 Tutorial)

Graphical Models

123 / 131

Structure Learning

Outline 1

Overview

2


3


4

Parameter Learning

5


Graphical Models

124 / 131

Structure Learning

Score-Based Structure Learning

Let M be all allowed candidate features

(KDD 2012 Tutorial)

Graphical Models

125 / 131

Structure Learning


Let M be all allowed candidate features Let M ⊆ M be a log-linear model structure 1 P (X | M, θ) = exp Z

(KDD 2012 Tutorial)

Graphical Models

! X

θi fi (X)

i∈M

125 / 131

Structure Learning



! X

θi fi (X)

i∈M

A score for the model M can be maxθ ln P (Data | M, θ)

(KDD 2012 Tutorial)

Graphical Models

125 / 131

Structure Learning



! X

θi fi (X)

i∈M

A score for the model M can be maxθ ln P (Data | M, θ) The score is always better for larger M – needs regularization or Bayesian treatment

(KDD 2012 Tutorial)

Graphical Models

125 / 131

Structure Learning



! X

θi fi (X)

i∈M

A score for the model M can be maxθ ln P (Data | M, θ) The score is always better for larger M – needs regularization or Bayesian treatment M and θ treated separately

(KDD 2012 Tutorial)

Graphical Models

125 / 131

Structure Learning

Structure Learning for Gaussian Random Fields

Consider a p-dimensional multivariate Gaussian N (µ, Σ)

(KDD 2012 Tutorial)

Graphical Models

126 / 131

Structure Learning


Consider a p-dimensional multivariate Gaussian N (µ, Σ) The graphical model has p nodes x1 , . . . , xp

(KDD 2012 Tutorial)

Graphical Models

126 / 131

Structure Learning


Consider a p-dimensional multivariate Gaussian N (µ, Σ) The graphical model has p nodes x1 , . . . , xp The edge between xi , xj is absent if and only if Ωij = 0, where Ω = Σ−1

(KDD 2012 Tutorial)

Graphical Models

126 / 131

Structure Learning


Consider a p-dimensional multivariate Gaussian N (µ, Σ) The graphical model has p nodes x1 , . . . , xp The edge between xi , xj is absent if and only if Ωij = 0, where Ω = Σ−1 Equivalently, xi , xj are conditionally independent given other variables

(KDD 2012 Tutorial)

Graphical Models

126 / 131

Structure Learning

Example  14 −16 4 −2 −16 32 −8 4   If we know Σ =   4 −8 8 −4 −2 4 −4 5 

(KDD 2012 Tutorial)

Graphical Models

127 / 131

Structure Learning

Example 

14 −16 4 −16 32 −8 If we know Σ =   4 −8 8 −2 4 −4  0.1667 0.0833  0.0833 0.0833 Then Ω = Σ−1 =  0.0000 0.0417 0 0

(KDD 2012 Tutorial)

 −2 4  −4 5

Graphical Models

 0.0000 0 0.0417 0   0.2500 0.1667 0.1667 0.3333

127 / 131

Structure Learning

Example  14 −16 4 −2 −16 32 −8 4   If we know Σ =   4 −8 8 −4 −2 4 −4 5   0.1667 0.0833 0.0000 0 0.0833 0.0833 0.0417 0   Then Ω = Σ−1 =  0.0000 0.0417 0.2500 0.1667 0 0 0.1667 0.3333 The corresponding graphical model structure is 

x1

x2 x3 x4

(KDD 2012 Tutorial)

Graphical Models

127 / 131

Structure Learning


Let data be X (1) , . . . , X (n) ∼ N (µ, Σ)

(KDD 2012 Tutorial)

Graphical Models

128 / 131

Structure Learning


Let data be X (1) , . . . , X (n) ∼ N (µ, Σ) P The log likelihood is n2 log |Ω| − 12 ni=1 (X (i) − µ)> Ω(X (i) − µ)

(KDD 2012 Tutorial)

Graphical Models

128 / 131

Structure Learning


Let data be X (1) , . . . , X (n) ∼ N (µ, Σ) P The log likelihood is n2 log |Ω| − 12 ni=1 (X (i) − µ)> Ω(X (i) − µ) The maximum likelihood estimate of Σ is the sample covariance S=

1 X (i) ¯ > (X (i) − X) ¯ (X − X) n i

¯ is the sample mean where X

(KDD 2012 Tutorial)

Graphical Models

128 / 131

Structure Learning


Let data be X (1) , . . . , X (n) ∼ N (µ, Σ) P The log likelihood is n2 log |Ω| − 12 ni=1 (X (i) − µ)> Ω(X (i) − µ) The maximum likelihood estimate of Σ is the sample covariance S=

1 X (i) ¯ > (X (i) − X) ¯ (X − X) n i

¯ is the sample mean where X S −1 is not a good estimate of Ω when n is small

(KDD 2012 Tutorial)

Graphical Models

128 / 131

Structure Learning


For centered data, minimize a regularized problem instead: n

X 1 X (i) > − log |Ω| + X ΩX (i) + λ |Ωij | n i=1

(KDD 2012 Tutorial)

Graphical Models

i6=j

129 / 131

Structure Learning


For centered data, minimize a regularized problem instead: n

X 1 X (i) > − log |Ω| + X ΩX (i) + λ |Ωij | n i=1

i6=j

Known as glasso

(KDD 2012 Tutorial)

Graphical Models

129 / 131

Structure Learning

Summary

Graphical model = joint distribution p(x1 , . . . , xn )

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

Summary

Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

Summary

Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field conditional independence

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

Summary


Inference = p(XQ | XE ), in general XQ ∪ XE ⊂ {x1 . . . xn }

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

Summary


Inference = p(XQ | XE ), in general XQ ∪ XE ⊂ {x1 . . . xn } exact, MCMC, variational

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

Summary



If p(x1 , . . . , xn ) not given, estimate it from data

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

Summary



If p(x1 , . . . , xn ) not given, estimate it from data parameter and structure learning

(KDD 2012 Tutorial)

Graphical Models

130 / 131

Structure Learning

References

Koller & Friedman, Probabilistic Graphical Models. MIT 2009

(KDD 2012 Tutorial)

Graphical Models

131 / 131

Structure Learning

References

Koller & Friedman, Probabilistic Graphical Models. MIT 2009 Wainwright & Jordan, Graphical Models, Exponential Families, and Variational Inference. FTML 2008

(KDD 2012 Tutorial)

Graphical Models

131 / 131

Structure Learning

References

Koller & Friedman, Probabilistic Graphical Models. MIT 2009 Wainwright & Jordan, Graphical Models, Exponential Families, and Variational Inference. FTML 2008 Bishop, Pattern Recognition and Machine Learning. Springer 2006.

(KDD 2012 Tutorial)

Graphical Models

131 / 131