Overview. The envelope quiz. Random variables E ∈ {1,0},B ∈ {r, b}. P(E = 1) =
P(E =0)=1/2. (KDD 2012 Tutorial). Graphical Models. 5 / 131 ...
Graphical Models Xiaojin Zhu Department of Computer Sciences University of Wisconsin–Madison, USA
KDD 2012 Tutorial
(KDD 2012 Tutorial)
Graphical Models
1 / 131
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
2 / 131
Overview
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
3 / 131
Overview
The envelope quiz
There are two envelopes, one has a red ball ($100) and a black ball, the other two black balls
(KDD 2012 Tutorial)
Graphical Models
4 / 131
Overview
The envelope quiz
There are two envelopes, one has a red ball ($100) and a black ball, the other two black balls You randomly picked an envelope, randomly took out a ball – it was black
(KDD 2012 Tutorial)
Graphical Models
4 / 131
Overview
The envelope quiz
There are two envelopes, one has a red ball ($100) and a black ball, the other two black balls You randomly picked an envelope, randomly took out a ball – it was black At this point, you are given the option to switch envelopes. Should you?
(KDD 2012 Tutorial)
Graphical Models
4 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b}
(KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2
(KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0
(KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2?
(KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2? (E=1) P (E = 1 | B = b) = P (B=b|E=1)P = 1/2×1/2 = 1/3 P (B=b) 3/4
(KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2? (E=1) P (E = 1 | B = b) = P (B=b|E=1)P = 1/2×1/2 = 1/3 P (B=b) 3/4 Switch.
(KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
The envelope quiz
Random variables E ∈ {1, 0}, B ∈ {r, b} P (E = 1) = P (E = 0) = 1/2 P (B = r | E = 1) = 1/2, P (B = r | E = 0) = 0 We ask: P (E = 1 | B = b) ≥ 1/2? (E=1) P (E = 1 | B = b) = P (B=b|E=1)P = 1/2×1/2 = 1/3 P (B=b) 3/4 Switch. The graphical model: E
B (KDD 2012 Tutorial)
Graphical Models
5 / 131
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1 , . . . , xn
(KDD 2012 Tutorial)
Graphical Models
6 / 131
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1 , . . . , xn e.g. (x1 , . . . , xn−1 ) a feature vector, xn ≡ y the class label
(KDD 2012 Tutorial)
Graphical Models
6 / 131
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1 , . . . , xn e.g. (x1 , . . . , xn−1 ) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1 , . . . , xn ), compute p(XQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn }
(KDD 2012 Tutorial)
Graphical Models
6 / 131
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1 , . . . , xn e.g. (x1 , . . . , xn−1 ) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1 , . . . , xn ), compute p(XQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } e.g. Q = {n}, E = {1 . . . n − 1}, by the definition of conditional p(x1 , . . . , xn−1 , xn ) v p(x1 , . . . , xn−1 , xn = v)
p(xn | x1 , . . . , xn−1 ) = P
(KDD 2012 Tutorial)
Graphical Models
6 / 131
Overview
Reasoning with uncertainty
The world is reduced to a set of random variables x1 , . . . , xn e.g. (x1 , . . . , xn−1 ) a feature vector, xn ≡ y the class label
Inference: given joint distribution p(x1 , . . . , xn ), compute p(XQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } e.g. Q = {n}, E = {1 . . . n − 1}, by the definition of conditional p(x1 , . . . , xn−1 , xn ) v p(x1 , . . . , xn−1 , xn = v)
p(xn | x1 , . . . , xn−1 ) = P
Learning: estimate p(x1 , . . . , xn ) from training data X (1) , . . . , X (N ) , (i) (i) where X (i) = (x1 , . . . , xn )
(KDD 2012 Tutorial)
Graphical Models
6 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn )
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.)
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)
inference p(XQ | XE )
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)
inference p(XQ | XE ) Often can’t afford to do it by brute force
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)
inference p(XQ | XE ) Often can’t afford to do it by brute force
If p(x1 , . . . , xn ) not given, estimate it from data
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)
inference p(XQ | XE ) Often can’t afford to do it by brute force
If p(x1 , . . . , xn ) not given, estimate it from data Often can’t afford to do it by brute force
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Overview
It is difficult to reason with uncertainty
joint distribution p(x1 , . . . , xn ) exponential na¨ıve storage (2n for binary r.v.) hard to interpret (conditional independence)
inference p(XQ | XE ) Often can’t afford to do it by brute force
If p(x1 , . . . , xn ) not given, estimate it from data Often can’t afford to do it by brute force
Graphical model: efficient representation, inference, and learning on p(x1 , . . . , xn ), exactly or approximately
(KDD 2012 Tutorial)
Graphical Models
7 / 131
Definitions
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
8 / 131
Definitions
Graphical-Model-Nots
Graphical model is the study of probabilistic models
(KDD 2012 Tutorial)
Graphical Models
9 / 131
Definitions
Graphical-Model-Nots
Graphical model is the study of probabilistic models Just because there are nodes and edges doesn’t mean it’s a graphical model
(KDD 2012 Tutorial)
Graphical Models
9 / 131
Definitions
Graphical-Model-Nots
Graphical model is the study of probabilistic models Just because there are nodes and edges doesn’t mean it’s a graphical model These are not graphical models:
neural network
(KDD 2012 Tutorial)
decision tree
network flow
Graphical Models
HMM template
9 / 131
Definitions
Directed Graphical Models
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
10 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks
(KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj
(KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk
(KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk A directed acyclic graph (DAG) contains no cycles
(KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk A directed acyclic graph (DAG) contains no cycles A Bayesian network on the DAG is a family of distributions satisfying Y {p | p(X) = p(xi | P a(xi ))} i
where P a(xi ) is the set of parents of xi .
(KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk A directed acyclic graph (DAG) contains no cycles A Bayesian network on the DAG is a family of distributions satisfying Y {p | p(X) = p(xi | P a(xi ))} i
where P a(xi ) is the set of parents of xi . p(xi | P a(xi )) is the conditional probability distribution (CPD) at xi
(KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Bayesian Network Directed graphical models are also called Bayesian networks A directed graph has nodes X = (x1 , . . . , xn ), some of them connected by directed edges xi → xj A cycle is a directed path x1 → . . . → xk where x1 = xk A directed acyclic graph (DAG) contains no cycles A Bayesian network on the DAG is a family of distributions satisfying Y {p | p(X) = p(xi | P a(xi ))} i
where P a(xi ) is the set of parents of xi . p(xi | P a(xi )) is the conditional probability distribution (CPD) at xi By specifying the CPDs for all i, we specify a particular distribution p(X) (KDD 2012 Tutorial)
Graphical Models
11 / 131
Definitions
Directed Graphical Models
Example: Alarm Binary variables P(B)=0.001
P(E)=0.002
E
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
M P(M | A) = 0.7 P(M | ~A) = 0.01
P (B, ∼ E, A, J, ∼ M ) = P (B)P (∼ E)P (A | B, ∼ E)P (J | A)P (∼ M | A) = 0.001 × (1 − 0.002) × 0.94 × 0.9 × (1 − 0.7) ≈ .000253
(KDD 2012 Tutorial)
Graphical Models
12 / 131
Definitions
Directed Graphical Models
Example: Naive Bayes
y
y
x1
...
xd
x d
p(y, x1 , . . . xd ) = p(y)
(KDD 2012 Tutorial)
Qd
i=1 p(xi
| y)
Graphical Models
13 / 131
Definitions
Directed Graphical Models
Example: Naive Bayes
y
y
x1
...
xd
x d
p(y, x1 , . . . xd ) = p(y)
Qd
i=1 p(xi
| y)
Used extensively in natural language processing
(KDD 2012 Tutorial)
Graphical Models
13 / 131
Definitions
Directed Graphical Models
Example: Naive Bayes
y
y
x1
...
xd
x d
p(y, x1 , . . . xd ) = p(y)
Qd
i=1 p(xi
| y)
Used extensively in natural language processing Plate representation on the right
(KDD 2012 Tutorial)
Graphical Models
13 / 131
Definitions
Directed Graphical Models
No Causality Whatsoever
P(A)=a P(B|A)=b P(B|~A)=c
A
B
B
A
P(B)=ab+(1−a)c P(A|B)=ab/(ab+(1−a)c) P(A|~B)=a(1−b)/(1−ab−(1−a)c)
The two BNs are equivalent in all respects Bayesian networks imply no causality at all
(KDD 2012 Tutorial)
Graphical Models
14 / 131
Definitions
Directed Graphical Models
No Causality Whatsoever
P(A)=a P(B|A)=b P(B|~A)=c
A
B
B
A
P(B)=ab+(1−a)c P(A|B)=ab/(ab+(1−a)c) P(A|~B)=a(1−b)/(1−ab−(1−a)c)
The two BNs are equivalent in all respects Bayesian networks imply no causality at all They only encode the joint probability distribution (hence correlation)
(KDD 2012 Tutorial)
Graphical Models
14 / 131
Definitions
Directed Graphical Models
No Causality Whatsoever
P(A)=a P(B|A)=b P(B|~A)=c
A
B
B
A
P(B)=ab+(1−a)c P(A|B)=ab/(ab+(1−a)c) P(A|~B)=a(1−b)/(1−ab−(1−a)c)
The two BNs are equivalent in all respects Bayesian networks imply no causality at all They only encode the joint probability distribution (hence correlation) However, people tend to design BNs based on causal relations
(KDD 2012 Tutorial)
Graphical Models
14 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Example: Latent Dirichlet Allocation (LDA)
β
φ
T
D
Nd w
z
θ
α
A generative model for p(φ, θ, z, w | α, β): For each topic t φt ∼ Dirichlet(β) For each document d θ ∼ Dirichlet(α) For each word position in d topic z ∼ Multinomial(θ) word w ∼ Multinomial(φz ) Inference goals: p(z | w, α, β), argmaxφ,θ p(φ, θ | w, α, β) (KDD 2012 Tutorial)
Graphical Models
15 / 131
Definitions
Directed Graphical Models
Some Topics by LDA on the Wish Corpus
p(word | topic)
“troops”
(KDD 2012 Tutorial)
“election”
Graphical Models
“love”
16 / 131
Definitions
Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent)
(KDD 2012 Tutorial)
Graphical Models
17 / 131
Definitions
Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r.v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent)
(KDD 2012 Tutorial)
Graphical Models
17 / 131
Definitions
Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r.v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent) This extends to groups of r.v.s
(KDD 2012 Tutorial)
Graphical Models
17 / 131
Definitions
Directed Graphical Models
Conditional Independence
Two r.v.s A, B are independent if P (A, B) = P (A)P (B) or P (A|B) = P (A) (the two are equivalent) Two r.v.s A, B are conditionally independent given C if P (A, B | C) = P (A | C)P (B | C) or P (A | B, C) = P (A | C) (the two are equivalent) This extends to groups of r.v.s Conditional independence in a BN is precisely specified by d-separation (“directed separation”)
(KDD 2012 Tutorial)
Graphical Models
17 / 131
Definitions
Directed Graphical Models
d-Separation Case 1: Tail-to-Tail
C
C A
A
B
B
A, B in general dependent
(KDD 2012 Tutorial)
Graphical Models
18 / 131
Definitions
Directed Graphical Models
d-Separation Case 1: Tail-to-Tail
C
C A
A
B
B
A, B in general dependent A, B conditionally independent given C (observed nodes are shaded)
(KDD 2012 Tutorial)
Graphical Models
18 / 131
Definitions
Directed Graphical Models
d-Separation Case 1: Tail-to-Tail
C
C A
A
B
B
A, B in general dependent A, B conditionally independent given C (observed nodes are shaded) An observed C is a tail-to-tail node, blocks the undirected path A-B
(KDD 2012 Tutorial)
Graphical Models
18 / 131
Definitions
Directed Graphical Models
d-Separation Case 2: Head-to-Tail
A
C
B
A
C
B
A, B in general dependent
(KDD 2012 Tutorial)
Graphical Models
19 / 131
Definitions
Directed Graphical Models
d-Separation Case 2: Head-to-Tail
A
C
B
A
C
B
A, B in general dependent A, B conditionally independent given C
(KDD 2012 Tutorial)
Graphical Models
19 / 131
Definitions
Directed Graphical Models
d-Separation Case 2: Head-to-Tail
A
C
B
A
C
B
A, B in general dependent A, B conditionally independent given C An observed C is a head-to-tail node, blocks the path A-B
(KDD 2012 Tutorial)
Graphical Models
19 / 131
Definitions
Directed Graphical Models
d-Separation Case 3: Head-to-Head
A
B
A
C
B
C
A, B in general independent
(KDD 2012 Tutorial)
Graphical Models
20 / 131
Definitions
Directed Graphical Models
d-Separation Case 3: Head-to-Head
A
B
A
C
B
C
A, B in general independent A, B conditionally dependent given C, or any of C’s descendants
(KDD 2012 Tutorial)
Graphical Models
20 / 131
Definitions
Directed Graphical Models
d-Separation Case 3: Head-to-Head
A
B
A
C
B
C
A, B in general independent A, B conditionally dependent given C, or any of C’s descendants An observed C is a head-to-head node, unblocks the path A-B
(KDD 2012 Tutorial)
Graphical Models
20 / 131
Definitions
Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked
(KDD 2012 Tutorial)
Graphical Models
21 / 131
Definitions
Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked A path is blocked if it includes a node x such that either
(KDD 2012 Tutorial)
Graphical Models
21 / 131
Definitions
Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked A path is blocked if it includes a node x such that either The path is head-to-tail or tail-to-tail at x and x ∈ C, or
(KDD 2012 Tutorial)
Graphical Models
21 / 131
Definitions
Directed Graphical Models
d-Separation
Any groups of nodes A and B are conditionally independent given another group C, if all undirected paths from any node in A to any node in B are blocked A path is blocked if it includes a node x such that either The path is head-to-tail or tail-to-tail at x and x ∈ C, or The path is head-to-head at x, and neither x nor any of its descendants is in C.
(KDD 2012 Tutorial)
Graphical Models
21 / 131
Definitions
Directed Graphical Models
d-Separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F
A
F
B
E
C (KDD 2012 Tutorial)
Graphical Models
22 / 131
Definitions
Directed Graphical Models
d-Separation Example 1 The undirected path from A to B is unblocked by E (because of C), and is not blocked by F A, B dependent given C
A
F
B
E
C (KDD 2012 Tutorial)
Graphical Models
22 / 131
Definitions
Directed Graphical Models
d-Separation Example 2 The path from A to B is blocked both at E and F
A
F
B
E
C (KDD 2012 Tutorial)
Graphical Models
23 / 131
Definitions
Directed Graphical Models
d-Separation Example 2 The path from A to B is blocked both at E and F A, B conditionally independent given F
A
F
B
E
C (KDD 2012 Tutorial)
Graphical Models
23 / 131
Definitions
Undirected Graphical Models
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
24 / 131
Definitions
Undirected Graphical Models
Markov Random Fields Undirected graphical models are also called Markov Random Fields
(KDD 2012 Tutorial)
Graphical Models
25 / 131
Definitions
Undirected Graphical Models
Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive
(KDD 2012 Tutorial)
Graphical Models
25 / 131
Definitions
Undirected Graphical Models
Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!)
(KDD 2012 Tutorial)
Graphical Models
25 / 131
Definitions
Undirected Graphical Models
Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!) Define a nonnegative potential function ψC : XC 7→ R+
(KDD 2012 Tutorial)
Graphical Models
25 / 131
Definitions
Undirected Graphical Models
Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!) Define a nonnegative potential function ψC : XC 7→ R+ An undirected graphical model is a family of distributions satisfying ( ) 1 Y p | p(X) = ψC (XC ) Z C
(KDD 2012 Tutorial)
Graphical Models
25 / 131
Definitions
Undirected Graphical Models
Markov Random Fields Undirected graphical models are also called Markov Random Fields The efficiency of directed graphical model (acyclic graph, locally normalized CPDs) also makes it restrictive A clique C in an undirected graph is a fully connected set of nodes (note: full of loops!) Define a nonnegative potential function ψC : XC 7→ R+ An undirected graphical model is a family of distributions satisfying ( ) 1 Y p | p(X) = ψC (XC ) Z C
Z=
RQ
C
ψC (XC )dX is the partition function
(KDD 2012 Tutorial)
Graphical Models
25 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1}
(KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2
(KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 p(x1 , x2 ) =
1 ax1 x2 Ze
(KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 1 ax1 x2 Ze e−a + e−a
p(x1 , x2 ) = Z = (ea +
(KDD 2012 Tutorial)
+ ea )
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 1 ax1 x2 Ze e−a + e−a
p(x1 , x2 ) = Z = (ea +
+ ea )
p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a )
(KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 1 ax1 x2 Ze e−a + e−a
p(x1 , x2 ) = Z = (ea +
+ ea )
p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a ) p(−1, 1) = p(1, −1) = e−a /(2ea + 2e−a )
(KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 1 ax1 x2 Ze e−a + e−a
p(x1 , x2 ) = Z = (ea +
+ ea )
p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a ) p(−1, 1) = p(1, −1) = e−a /(2ea + 2e−a ) When the parameter a > 0, favor homogeneous chains
(KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Example: A Tiny Markov Random Field
x1
x2 C
x1 , x2 ∈ {−1, 1} A single clique ψC (x1 , x2 ) = eax1 x2 1 ax1 x2 Ze e−a + e−a
p(x1 , x2 ) = Z = (ea +
+ ea )
p(1, 1) = p(−1, −1) = ea /(2ea + 2e−a ) p(−1, 1) = p(1, −1) = e−a /(2ea + 2e−a ) When the parameter a > 0, favor homogeneous chains When the parameter a < 0, favor inhomogeneous chains (KDD 2012 Tutorial)
Graphical Models
26 / 131
Definitions
Undirected Graphical Models
Log Linear Models
Real-valued feature functions f1 (X), . . . , fk (X)
(KDD 2012 Tutorial)
Graphical Models
27 / 131
Definitions
Undirected Graphical Models
Log Linear Models
Real-valued feature functions f1 (X), . . . , fk (X) Real-valued weights w1 , . . . , wk ! k X 1 p(X) = exp − wi fi (X) Z i=1
(KDD 2012 Tutorial)
Graphical Models
27 / 131
Definitions
Undirected Graphical Models
Example: The Ising Model θs xs
θst
xt
This is an undirected model with x ∈ {0, 1}. X X 1 θs xs + pθ (x) = exp θst xs xt Z s∈V
(s,t)∈E
fs (X) = xs , fst (X) = xs xt
(KDD 2012 Tutorial)
Graphical Models
28 / 131
Definitions
Undirected Graphical Models
Example: The Ising Model θs xs
θst
xt
This is an undirected model with x ∈ {0, 1}. X X 1 θs xs + pθ (x) = exp θst xs xt Z s∈V
(s,t)∈E
fs (X) = xs , fst (X) = xs xt ws = −θs , wst = −θst (KDD 2012 Tutorial)
Graphical Models
28 / 131
Definitions
Undirected Graphical Models
Example: Image Denoising
noisy image argmaxX P (X|Y ) X X 1 pθ (X | Y ) = exp θs xs + θst xs xt Z s∈V (s,t)∈E c ys = 1 θs = −c ys = 0
[From Bishop PRML]
(KDD 2012 Tutorial)
Graphical Models
29 / 131
Definitions
Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N (µ, Σ) =
1 (2π)n/2 |Σ|1/2
1 > −1 exp − (X − µ) Σ (X − µ) 2
Multivariate Gaussian
(KDD 2012 Tutorial)
Graphical Models
30 / 131
Definitions
Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N (µ, Σ) =
1 (2π)n/2 |Σ|1/2
1 > −1 exp − (X − µ) Σ (X − µ) 2
Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite
(KDD 2012 Tutorial)
Graphical Models
30 / 131
Definitions
Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N (µ, Σ) =
1 (2π)n/2 |Σ|1/2
1 > −1 exp − (X − µ) Σ (X − µ) 2
Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite Let Ω = Σ−1 be the precision matrix
(KDD 2012 Tutorial)
Graphical Models
30 / 131
Definitions
Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N (µ, Σ) =
1 (2π)n/2 |Σ|1/2
1 > −1 exp − (X − µ) Σ (X − µ) 2
Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite Let Ω = Σ−1 be the precision matrix xi , xj are conditionally independent given all other variables, if and only if Ωij = 0
(KDD 2012 Tutorial)
Graphical Models
30 / 131
Definitions
Undirected Graphical Models
Example: Gaussian Random Field
p(X) ∼ N (µ, Σ) =
1 (2π)n/2 |Σ|1/2
1 > −1 exp − (X − µ) Σ (X − µ) 2
Multivariate Gaussian The n × n covariance matrix Σ positive semi-definite Let Ω = Σ−1 be the precision matrix xi , xj are conditionally independent given all other variables, if and only if Ωij = 0 When Ωij 6= 0, there is an edge between xi , xj
(KDD 2012 Tutorial)
Graphical Models
30 / 131
Definitions
Undirected Graphical Models
Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if:
C
A
B (KDD 2012 Tutorial)
Graphical Models
31 / 131
Definitions
Undirected Graphical Models
Conditional Independence in Markov Random Fields Two group of variables A, B are conditionally independent given another group C, if: A, B become disconnected by removing C and all edges involving C
C
A
B (KDD 2012 Tutorial)
Graphical Models
31 / 131
Definitions
Factor Graph
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
32 / 131
Definitions
Factor Graph
Factor Graph For both directed and undirected graphical models
A
B
A
ψ (A,B,C)
f ψ (A,B,C)
C
A
B
C
B
A
B f
C (KDD 2012 Tutorial)
P(A)P(B)P(C|A,B)
C Graphical Models
33 / 131
Definitions
Factor Graph
Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node
A
B
A
ψ (A,B,C)
f ψ (A,B,C)
C
A
B
C
B
A
B f
C (KDD 2012 Tutorial)
P(A)P(B)P(C|A,B)
C Graphical Models
33 / 131
Definitions
Factor Graph
Factor Graph For both directed and undirected graphical models Bipartite: edges between a variable node and a factor node Factors represent computation A
B
A
ψ (A,B,C)
f ψ (A,B,C)
C
A
B
C
B
A
B f
C (KDD 2012 Tutorial)
P(A)P(B)P(C|A,B)
C Graphical Models
33 / 131
Inference
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
34 / 131
Inference
Exact Inference
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
35 / 131
Inference
Exact Inference
Inference by Enumeration
Let X = (XQ , XE , XO ) for query, evidence, and other variables.
(KDD 2012 Tutorial)
Graphical Models
36 / 131
Inference
Exact Inference
Inference by Enumeration
Let X = (XQ , XE , XO ) for query, evidence, and other variables. Infer P (XQ | XE )
(KDD 2012 Tutorial)
Graphical Models
36 / 131
Inference
Exact Inference
Inference by Enumeration
Let X = (XQ , XE , XO ) for query, evidence, and other variables. Infer P (XQ | XE ) By definition P P (XQ , XE ) X P (XQ , XE , XO ) P (XQ | XE ) = =P O P (XE ) XQ ,XO P (XQ , XE , XO )
(KDD 2012 Tutorial)
Graphical Models
36 / 131
Inference
Exact Inference
Inference by Enumeration
Let X = (XQ , XE , XO ) for query, evidence, and other variables. Infer P (XQ | XE ) By definition P P (XQ , XE ) X P (XQ , XE , XO ) P (XQ | XE ) = =P O P (XE ) XQ ,XO P (XQ , XE , XO ) Summing exponential number of terms: with k variables in XO each taking r values, there are rk terms
(KDD 2012 Tutorial)
Graphical Models
36 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk
(KDD 2012 Tutorial)
Graphical Models
37 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take:
(KDD 2012 Tutorial)
Graphical Models
Pvr
xi =v1
37 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk
(KDD 2012 Tutorial)
Graphical Models
Pvr
xi =v1
37 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk P We want x1 ...xk p(X)
(KDD 2012 Tutorial)
Graphical Models
Pvr
xi =v1
37 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk P We want x1 ...xk p(X)
Pvr
xi =v1
For a graphical model, the joint probability factors Q p(X) = m f j=1 j (X(j) )
(KDD 2012 Tutorial)
Graphical Models
37 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk P We want x1 ...xk p(X)
Pvr
xi =v1
For a graphical model, the joint probability factors Q p(X) = m f j=1 j (X(j) ) Each factor fj operates on a subset X(j) ⊆ X
(KDD 2012 Tutorial)
Graphical Models
37 / 131
Inference
Exact Inference
Details of the summing problem
There are a bunch of “other” variables x1 , . . . , xk We sum over the r values each variable can take: P This is exponential (rk ): x1 ...xk P We want x1 ...xk p(X)
Pvr
xi =v1
For a graphical model, the joint probability factors Q p(X) = m f j=1 j (X(j) ) Each factor fj operates on a subset X(j) ⊆ X Idea to save computation: dynamic programming
(KDD 2012 Tutorial)
Graphical Models
37 / 131
Inference
Exact Inference
Eliminating a Variable
Rearrange factors
(KDD 2012 Tutorial)
P
x1 ...xk
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j)
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm
Rearrange factors
(KDD 2012 Tutorial)
P
x1 ...xk
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm
Rearrange factors
(KDD 2012 Tutorial)
P
x1 ...xk
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm
Rearrange factors
P
x1 ...xk
− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1
(KDD 2012 Tutorial)
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm
Rearrange factors
P
x1 ...xk
− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1
(KDD 2012 Tutorial)
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm
Rearrange factors
P
x1 ...xk
− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1 − Dynamic programming: compute fm+1 once, use it thereafter
(KDD 2012 Tutorial)
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm
Rearrange factors
P
x1 ...xk
− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1 − Dynamic programming: compute fm+1 once, use it thereafter − Hope: fm+1 contains very few variables
(KDD 2012 Tutorial)
Graphical Models
38 / 131
Inference
Exact Inference
Eliminating a Variable
+ + by whether x ∈ X f1− . . . fl− fl+1 . . . fm 1 (j) P − − P + + Obviously equivalent: x2 ...xk f1 . . . fl x1 fl+1 . . . fm P − + + Introduce a new factor fm+1 = x1 fl+1 . . . fm
Rearrange factors
P
x1 ...xk
− + + except x fm+1 contains the union of variables in fl+1 . . . fm 1 P − − − In fact, x1 disappears altogether in x2 ...xk f1 . . . fl fm+1 − Dynamic programming: compute fm+1 once, use it thereafter − Hope: fm+1 contains very few variables
Recursively eliminate other variables in turn
(KDD 2012 Tutorial)
Graphical Models
38 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
Binary variables
(KDD 2012 Tutorial)
Graphical Models
39 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
Binary variables Say we want P (D) =
(KDD 2012 Tutorial)
P
A,B,C
P (A)P (B|A)P (C|B)P (D|C)
Graphical Models
39 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
Binary variables Say we want P (D) =
P
A,B,C
P (A)P (B|A)P (C|B)P (D|C)
Let f1 (A) = P (A). Note f1 is an array (from CDP) of size two: P (A = 0) P (A = 1)
(KDD 2012 Tutorial)
Graphical Models
39 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
Binary variables Say we want P (D) =
P
A,B,C
P (A)P (B|A)P (C|B)P (D|C)
Let f1 (A) = P (A). Note f1 is an array (from CDP) of size two: P (A = 0) P (A = 1) f2 (A, B) is a table of size four: P (B = 0|A = 0) P (B = 0|A = 1) P (B = 1|A = 0) P (B = 1|A = 1))
(KDD 2012 Tutorial)
Graphical Models
39 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
Binary variables Say we want P (D) =
P
A,B,C
P (A)P (B|A)P (C|B)P (D|C)
Let f1 (A) = P (A). Note f1 is an array (from CDP) of size two: P (A = 0) P (A = 1) f2 (A, B) is a table of size four: P (B = 0|A = 0) P (B = 0|A = 1) P (B = 1|A = 0) P (B = 1|A = 1)) P (B, C)f4 (C, D) = PA,B,C f1 (A)f2 (A, B)f3 P f (B, C)f (C, D)( 4 B,C 3 A f1 (A)f2 (A, B)) (KDD 2012 Tutorial)
Graphical Models
39 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1)
(KDD 2012 Tutorial)
Graphical Models
40 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1)
(KDD 2012 Tutorial)
Graphical Models
40 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1) For this example, f5 (B) happens to be P (B)
(KDD 2012 Tutorial)
Graphical Models
40 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1) For this example, f5 (B) happens to be P (B) P P P B,C f3 (B, C)f4 (C, D)f5 (B) = C f4 (C, D)( B f3 (B, C)f5 (B)), and so on
(KDD 2012 Tutorial)
Graphical Models
40 / 131
Inference
Exact Inference
Example: Chain Graph A
B
C
D
f1 (A)f2 (A, B) is an array of size four: match A values P (A = 0)P (B = 0|A = 0) P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) P (A = 1)P (B = 1|A = 1) P f5 (B) ≡ A f1 (A)f2 (A, B) reduces to an array of size two P (A = 0)P (B = 0|A = 0) + P (A = 1)P (B = 0|A = 1) P (A = 0)P (B = 1|A = 0) + P (A = 1)P (B = 1|A = 1) For this example, f5 (B) happens to be P (B) P P P B,C f3 (B, C)f4 (C, D)f5 (B) = C f4 (C, D)( B f3 (B, C)f5 (B)), and so on In the end, f7 (D) is represented by P (D = 0), P (D = 1) (KDD 2012 Tutorial)
Graphical Models
40 / 131
Inference
Exact Inference
Example: Chain Graph
A
B
C
D
Computation for P (D): 12 ×, 6 +
(KDD 2012 Tutorial)
Graphical Models
41 / 131
Inference
Exact Inference
Example: Chain Graph
A
B
C
D
Computation for P (D): 12 ×, 6 + Enumeration: 48 ×, 14 +
(KDD 2012 Tutorial)
Graphical Models
41 / 131
Inference
Exact Inference
Example: Chain Graph
A
B
C
D
Computation for P (D): 12 ×, 6 + Enumeration: 48 ×, 14 + Saving depends on elimination order. Finding optimal order NP-hard; there are heuristic methods.
(KDD 2012 Tutorial)
Graphical Models
41 / 131
Inference
Exact Inference
Example: Chain Graph
A
B
C
D
Computation for P (D): 12 ×, 6 + Enumeration: 48 ×, 14 + Saving depends on elimination order. Finding optimal order NP-hard; there are heuristic methods. Saving depends more critically on the graph structure (tree width), may still be intractable
(KDD 2012 Tutorial)
Graphical Models
41 / 131
Inference
Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e
(KDD 2012 Tutorial)
Graphical Models
42 / 131
Inference
Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e Eliminate variables XO = X − XE − XQ
(KDD 2012 Tutorial)
Graphical Models
42 / 131
Inference
Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e Eliminate variables XO = X − XE − XQ The final factor will be the joint f (XQ ) = P (XQ , XE = e)
(KDD 2012 Tutorial)
Graphical Models
42 / 131
Inference
Exact Inference
Handling Evidence
For evidence variables XE , simply plug in their value e Eliminate variables XO = X − XE − XQ The final factor will be the joint f (XQ ) = P (XQ , XE = e) Normalize to answer query: f (XQ ) XQ f (XQ )
P (XQ | XE = e) = P
(KDD 2012 Tutorial)
Graphical Models
42 / 131
Inference
Exact Inference
Summary: Exact Inference
Common methods:
(KDD 2012 Tutorial)
Graphical Models
43 / 131
Inference
Exact Inference
Summary: Exact Inference
Common methods: Enumeration
(KDD 2012 Tutorial)
Graphical Models
43 / 131
Inference
Exact Inference
Summary: Exact Inference
Common methods: Enumeration Variable elimination
(KDD 2012 Tutorial)
Graphical Models
43 / 131
Inference
Exact Inference
Summary: Exact Inference
Common methods: Enumeration Variable elimination Not covered: junction tree (aka clique tree)
(KDD 2012 Tutorial)
Graphical Models
43 / 131
Inference
Exact Inference
Summary: Exact Inference
Common methods: Enumeration Variable elimination Not covered: junction tree (aka clique tree)
Exact, but often intractable for large graphs
(KDD 2012 Tutorial)
Graphical Models
43 / 131
Inference
Markov Chain Monte Carlo
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
44 / 131
Inference
Markov Chain Monte Carlo
Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ
(KDD 2012 Tutorial)
Graphical Models
45 / 131
Inference
Markov Chain Monte Carlo
Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ (1)
(m)
If we can draw samples xQ , . . . xQ estimator is
∼ p(xQ | XE ), an unbiased m
1 X p(XQ = cQ | XE ) ≈ 1(x(i) =c ) Q m Q i=1
(KDD 2012 Tutorial)
Graphical Models
45 / 131
Inference
Markov Chain Monte Carlo
Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ (1)
(m)
If we can draw samples xQ , . . . xQ estimator is
∼ p(xQ | XE ), an unbiased m
1 X p(XQ = cQ | XE ) ≈ 1(x(i) =c ) Q m Q i=1
The variance of the estimator decreases as O(1/m)
(KDD 2012 Tutorial)
Graphical Models
45 / 131
Inference
Markov Chain Monte Carlo
Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ (1)
(m)
If we can draw samples xQ , . . . xQ estimator is
∼ p(xQ | XE ), an unbiased m
1 X p(XQ = cQ | XE ) ≈ 1(x(i) =c ) Q m Q i=1
The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(xQ | XE )
(KDD 2012 Tutorial)
Graphical Models
45 / 131
Inference
Markov Chain Monte Carlo
Inference by Monte Carlo Consider the inference problem p(XQ = cQ | XE ) where XQ ∪ XE ⊆ {x1 . . . xn } Z p(XQ = cQ | XE ) = 1(xQ =cQ ) p(xQ | XE )dxQ (1)
(m)
If we can draw samples xQ , . . . xQ estimator is
∼ p(xQ | XE ), an unbiased m
1 X p(XQ = cQ | XE ) ≈ 1(x(i) =c ) Q m Q i=1
The variance of the estimator decreases as O(1/m) Inference reduces to sampling from p(xQ | XE ) We discuss two methods: forward sampling and Gibbs sampling (KDD 2012 Tutorial)
Graphical Models
45 / 131
Inference
Markov Chain Monte Carlo
Forward Sampling: Example P(B)=0.001
P(E)=0.002
E
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
M P(M | A) = 0.7 P(M | ~A) = 0.01
To generate a sample X = (B, E, A, J, M ): 1
Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B=0
(KDD 2012 Tutorial)
Graphical Models
46 / 131
Inference
Markov Chain Monte Carlo
Forward Sampling: Example P(B)=0.001
P(E)=0.002
E
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
M P(M | A) = 0.7 P(M | ~A) = 0.01
To generate a sample X = (B, E, A, J, M ): 1
Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B=0
2
Sample E ∼ Ber(0.002)
(KDD 2012 Tutorial)
Graphical Models
46 / 131
Inference
Markov Chain Monte Carlo
Forward Sampling: Example P(B)=0.001
P(E)=0.002
E
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
M P(M | A) = 0.7 P(M | ~A) = 0.01
To generate a sample X = (B, E, A, J, M ): 1
Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B=0
2
Sample E ∼ Ber(0.002)
3
If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
(KDD 2012 Tutorial)
Graphical Models
46 / 131
Inference
Markov Chain Monte Carlo
Forward Sampling: Example P(B)=0.001
P(E)=0.002
E
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
M P(M | A) = 0.7 P(M | ~A) = 0.01
To generate a sample X = (B, E, A, J, M ): 1
Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B=0
2
Sample E ∼ Ber(0.002)
3
If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4
If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)
(KDD 2012 Tutorial)
Graphical Models
46 / 131
Inference
Markov Chain Monte Carlo
Forward Sampling: Example P(B)=0.001
P(E)=0.002
E
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
M P(M | A) = 0.7 P(M | ~A) = 0.01
To generate a sample X = (B, E, A, J, M ): 1
Sample B ∼ Ber(0.001): r ∼ U (0, 1). If (r < 0.001) then B = 1 else B=0
2
Sample E ∼ Ber(0.002)
3
If B = 1 and E = 1, sample A ∼ Ber(0.95), and so on
4
If A = 1 sample J ∼ Ber(0.9) else J ∼ Ber(0.05)
5
If A = 1 sample M ∼ Ber(0.7) else M ∼ Ber(0.01) (KDD 2012 Tutorial)
Graphical Models
46 / 131
Inference
Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1, M = 1)
(KDD 2012 Tutorial)
Graphical Models
47 / 131
Inference
Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m
p(B = 1 | E = 1, M = 1) ≈
1 X 1(B (i) =1) m i=1
where m is the number of surviving samples
(KDD 2012 Tutorial)
Graphical Models
47 / 131
Inference
Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m
p(B = 1 | E = 1, M = 1) ≈
1 X 1(B (i) =1) m i=1
where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny)
(KDD 2012 Tutorial)
Graphical Models
47 / 131
Inference
Markov Chain Monte Carlo
Inference with Forward Sampling
Say the inference task is P (B = 1 | E = 1, M = 1) Throw away all samples except those with (E = 1, M = 1) m
p(B = 1 | E = 1, M = 1) ≈
1 X 1(B (i) =1) m i=1
where m is the number of surviving samples Can be highly inefficient (note P (E = 1) tiny) Does not work for Markov Random Fields
(KDD 2012 Tutorial)
Graphical Models
47 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method.
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
(KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01
48 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE )
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
(KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01
48 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
(KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01
48 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models Initialization:
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
(KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01
48 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models Initialization: Fix evidence; randomly set other variables
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
(KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01
48 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampler: Example P (B = 1 | E = 1, M = 1) Gibbs sampler is a Markov Chain Monte Carlo (MCMC) method. Directly sample from p(xQ | XE ) Works for both graphical models Initialization: Fix evidence; randomly set other variables e.g. X (0) = (B = 0, E = 1, A = 0, J = 0, M = 1) P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05
(KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01
48 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i )
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
(KDD 2012 Tutorial)
M=1
Models P(M | A) = 0.7 P(J | A)Graphical = 0.9
49 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i ) This is equivalent to xi ∼ P (xi | MarkovBlanket(xi ))
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
(KDD 2012 Tutorial)
M=1
Models P(M | A) = 0.7 P(J | A)Graphical = 0.9
49 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i ) This is equivalent to xi ∼ P (xi | MarkovBlanket(xi )) For a Bayesian network MarkovBlanket(xi ) includes xi ’s parents, spouses, and children Y P (xi | MarkovBlanket(xi )) ∝ P (xi | P a(xi )) P (y | P a(y)) y∈C(xi )
where P a(x) are the parents of x, and C(x) the children of x.
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
(KDD 2012 Tutorial)
M=1
Models P(M | A) = 0.7 P(J | A)Graphical = 0.9
49 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i ) This is equivalent to xi ∼ P (xi | MarkovBlanket(xi )) For a Bayesian network MarkovBlanket(xi ) includes xi ’s parents, spouses, and children Y P (xi | MarkovBlanket(xi )) ∝ P (xi | P a(xi )) P (y | P a(y)) y∈C(xi )
where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small.
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
(KDD 2012 Tutorial)
M=1
Models P(M | A) = 0.7 P(J | A)Graphical = 0.9
49 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update For each non-evidence variable xi , fixing all other nodes X−i , resample its value xi ∼ P (xi | X−i ) This is equivalent to xi ∼ P (xi | MarkovBlanket(xi )) For a Bayesian network MarkovBlanket(xi ) includes xi ’s parents, spouses, and children Y P (xi | MarkovBlanket(xi )) ∝ P (xi | P a(xi )) P (y | P a(y)) y∈C(xi )
where P a(x) are the parents of x, and C(x) the children of x. For many graphical models the Markov Blanket is small. For example, B ∼ P (B | E = 1, A = 0) ∝ P (B)P (A = 0 | B, E = 1) P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
(KDD 2012 Tutorial)
M=1
Models P(M | A) = 0.7 P(J | A)Graphical = 0.9
49 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1)
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05 (KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1) , sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X (2)
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05 (KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1) , sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J . . .
P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05 (KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131
Inference
Markov Chain Monte Carlo
Gibbs Update Say we sampled B = 1. Then X (1) = (B = 1, E = 1, A = 0, J = 0, M = 1) Starting from X (1) , sample A ∼ P (A | B = 1, E = 1, J = 0, M = 1) to get X (2) Move on to J, then repeat B, A, J, B, A, J . . . Keep all later samples. P (B = 1 | E = 1, M = 1) is the fraction of samples with B = 1. P(B)=0.001
P(E)=0.002
E=1
B P(A | B, E) = 0.95 P(A | B, ~E) = 0.94 P(A | ~B, E) = 0.29 P(A | ~B, ~E) = 0.001
A J
P(J | A) = 0.9 P(J | ~A) = 0.05 (KDD 2012 Tutorial)
Graphical Models
M=1 P(M | A) = 0.7 P(M | ~A) = 0.01 50 / 131
Markov Chain Monte Carlo
Inference
Gibbs Example 2: The Ising Model
A xs
B
D
C
This is an undirected model with x ∈ {0, 1}. X X 1 θs xs + θst xs xt pθ (x) = exp Z s∈V
(KDD 2012 Tutorial)
Graphical Models
(s,t)∈E
51 / 131
Markov Chain Monte Carlo
Inference
Gibbs Example 2: The Ising Model A B
xs
D
C
The Markov blanket of xs is A, B, C, D
(KDD 2012 Tutorial)
Graphical Models
52 / 131
Markov Chain Monte Carlo
Inference
Gibbs Example 2: The Ising Model A B
xs
D
C
The Markov blanket of xs is A, B, C, D In general for undirected graphical models p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s.
(KDD 2012 Tutorial)
Graphical Models
52 / 131
Markov Chain Monte Carlo
Inference
Gibbs Example 2: The Ising Model A xs
B
D
C
The Markov blanket of xs is A, B, C, D In general for undirected graphical models p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s. The Gibbs update is p(xs = 1 | xN (s) ) =
(KDD 2012 Tutorial)
1 exp(−(θs +
Graphical Models
P
t∈N (s) θst xt ))
+1 52 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X)
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE )
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix)
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice “burn in”: discard X (0) , . . . , X (T )
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Gibbs Sampling as a Markov Chain
A Markov chain is defined by a transition matrix T (X 0 | X) Certain Markov chains have a stationary distribution π such that π = Tπ Gibbs sampler is such a Markov chain with Ti ((X−i , x0i ) | (X−i , xi )) = p(x0i | X−i ), and stationary distribution p(xQ | XE ) But it takes time for the chain to reach stationary distribution (mix) Can be difficult to assert mixing In practice “burn in”: discard X (0) , . . . , X (T ) Use all of X (T +1) , . . . for inference (they are correlated); Do not thin
(KDD 2012 Tutorial)
Graphical Models
53 / 131
Inference
Markov Chain Monte Carlo
Collapsed Gibbs Sampling In general, Ep [f (X)] ≈
(KDD 2012 Tutorial)
1 m
Pm
i=1 f (X
(i) )
Graphical Models
if X (i) ∼ p
54 / 131
Inference
Markov Chain Monte Carlo
Collapsed Gibbs Sampling In general, Ep [f (X)] ≈
1 m
Pm
i=1 f (X
(i) )
if X (i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations
(KDD 2012 Tutorial)
Graphical Models
54 / 131
Inference
Markov Chain Monte Carlo
Collapsed Gibbs Sampling In general, Ep [f (X)] ≈
1 m
Pm
i=1 f (X
(i) )
if X (i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations If so, Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)] m 1 X ≈ Ep(Z|Y (i) ) [f (Y (i) , Z)] m i=1
if Y (i) ∼ p(Y )
(KDD 2012 Tutorial)
Graphical Models
54 / 131
Inference
Markov Chain Monte Carlo
Collapsed Gibbs Sampling In general, Ep [f (X)] ≈
1 m
Pm
i=1 f (X
(i) )
if X (i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations If so, Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)] m 1 X ≈ Ep(Z|Y (i) ) [f (Y (i) , Z)] m i=1
if Y (i) ∼ p(Y ) No need to sample Z: it is collapsed
(KDD 2012 Tutorial)
Graphical Models
54 / 131
Inference
Markov Chain Monte Carlo
Collapsed Gibbs Sampling In general, Ep [f (X)] ≈
1 m
Pm
i=1 f (X
(i) )
if X (i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations If so, Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)] m 1 X ≈ Ep(Z|Y (i) ) [f (Y (i) , Z)] m i=1
if Y (i) ∼ p(Y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti ((Y−i , yi0 ) | (Y−i , yi )) = p(yi0 | Y−i )
(KDD 2012 Tutorial)
Graphical Models
54 / 131
Inference
Markov Chain Monte Carlo
Collapsed Gibbs Sampling In general, Ep [f (X)] ≈
1 m
Pm
i=1 f (X
(i) )
if X (i) ∼ p
Sometimes X = (Y, Z) where Z has closed-form operations If so, Ep [f (X)] = Ep(Y ) Ep(Z|Y ) [f (Y, Z)] m 1 X ≈ Ep(Z|Y (i) ) [f (Y (i) , Z)] m i=1
if Y (i) ∼ p(Y ) No need to sample Z: it is collapsed Collapsed Gibbs sampler Ti ((Y−i , yi0 ) | (Y−i , yi )) = p(yi0 | Y−i ) R Note p(yi0 | Y−i ) = p(yi0 , Z | Y−i )dZ (KDD 2012 Tutorial)
Graphical Models
54 / 131
Markov Chain Monte Carlo
Inference
Example: Collapsed Gibbs Sampling for LDA φ
β
D
Nd
T
w
z
θ
α
Collapse θ, φ, Gibbs update: (w )
P (zi = j | z−i , w) ∝
(d )
i i n−i,j + βn−i,j +α
(·)
(d )
i n−i,j + W βn−i,· + Tα
(w )
i n−i,j : number of times word wi has been assigned to topic j, excluding the current position
(KDD 2012 Tutorial)
Graphical Models
55 / 131
Inference
Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: (w )
P (zi = j | z−i , w) ∝
(d )
i i n−i,j + βn−i,j +α
(·)
(d )
i n−i,j + W βn−i,· + Tα
(w )
i n−i,j : number of times word wi has been assigned to topic j, excluding the current position
(d )
i n−i,j : number of times a word from document di has been assigned to topic j, excluding the current position
(KDD 2012 Tutorial)
Graphical Models
55 / 131
Inference
Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: (w )
P (zi = j | z−i , w) ∝
(d )
i i n−i,j + βn−i,j +α
(·)
(d )
i n−i,j + W βn−i,· + Tα
(w )
i n−i,j : number of times word wi has been assigned to topic j, excluding the current position
(d )
i n−i,j : number of times a word from document di has been assigned to topic j, excluding the current position
(·)
n−i,j : number of times any word has been assigned to topic j, excluding the current position
(KDD 2012 Tutorial)
Graphical Models
55 / 131
Inference
Markov Chain Monte Carlo
Example: Collapsed Gibbs Sampling for LDA Collapse θ, φ, Gibbs update: (w )
P (zi = j | z−i , w) ∝
(d )
i i n−i,j + βn−i,j +α
(·)
(d )
i n−i,j + W βn−i,· + Tα
(w )
i n−i,j : number of times word wi has been assigned to topic j, excluding the current position
(d )
i n−i,j : number of times a word from document di has been assigned to topic j, excluding the current position
(·)
n−i,j : number of times any word has been assigned to topic j, excluding the current position (d )
i n−i,· : length of document di , excluding the current position
(KDD 2012 Tutorial)
Graphical Models
55 / 131
Inference
Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling
(KDD 2012 Tutorial)
Graphical Models
56 / 131
Inference
Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling Gibbs sampling
(KDD 2012 Tutorial)
Graphical Models
56 / 131
Inference
Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling Gibbs sampling Collapsed Gibbs sampling
(KDD 2012 Tutorial)
Graphical Models
56 / 131
Inference
Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc.
(KDD 2012 Tutorial)
Graphical Models
56 / 131
Inference
Markov Chain Monte Carlo
Summary: Markov Chain Monte Carlo
Forward sampling Gibbs sampling Collapsed Gibbs sampling Not covered: block Gibbs, Metropolis-Hastings, etc. Unbiased (after burn-in), but can have high variance
(KDD 2012 Tutorial)
Graphical Models
56 / 131
Inference
Belief Propagation
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
57 / 131
Inference
Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP)
(KDD 2012 Tutorial)
Graphical Models
58 / 131
Inference
Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as “loopy BP”, approximate
(KDD 2012 Tutorial)
Graphical Models
58 / 131
Inference
Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as “loopy BP”, approximate The algorithm involves passing messages on the factor graph
(KDD 2012 Tutorial)
Graphical Models
58 / 131
Inference
Belief Propagation
The Sum-Product Algorithm
Also known as belief propagation (BP) Exact if the graph is a tree; otherwise known as “loopy BP”, approximate The algorithm involves passing messages on the factor graph Alternative view: variational approximation (more later)
(KDD 2012 Tutorial)
Graphical Models
58 / 131
Inference
Belief Propagation
Example: A Simple HMM
The Hidden Markov Model template (not a graphical model) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
(KDD 2012 Tutorial)
Graphical Models
59 / 131
Inference
Belief Propagation
Example: A Simple HMM
Observing x1 = R, x2 = G, the directed graphical model
(KDD 2012 Tutorial)
z1
z2
x1=R
x2=G
Graphical Models
60 / 131
Inference
Belief Propagation
Example: A Simple HMM
Observing x1 = R, x2 = G, the directed graphical model z1
z2
x1=R
x2=G
Factor graph f1 P(z1)P(x1| z1)
(KDD 2012 Tutorial)
z1
f2
z2
P(z2| z1)P(x2| z2)
Graphical Models
60 / 131
Inference
Belief Propagation
Messages
A message is a vector of length K, where K is the number of values x takes.
(KDD 2012 Tutorial)
Graphical Models
61 / 131
Inference
Belief Propagation
Messages
A message is a vector of length K, where K is the number of values x takes. There are two types of messages:
(KDD 2012 Tutorial)
Graphical Models
61 / 131
Inference
Belief Propagation
Messages
A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1
µf →x : message from a factor node f to a variable node x µf →x (i) is the ith element, i = 1 . . . K.
(KDD 2012 Tutorial)
Graphical Models
61 / 131
Inference
Belief Propagation
Messages
A message is a vector of length K, where K is the number of values x takes. There are two types of messages: 1
2
µf →x : message from a factor node f to a variable node x µf →x (i) is the ith element, i = 1 . . . K. µx→f : message from a variable node x to a factor node f
(KDD 2012 Tutorial)
Graphical Models
61 / 131
Inference
Belief Propagation
Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2 (KDD 2012 Tutorial)
Graphical Models
62 / 131
Inference
Belief Propagation
Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2 Start messages at leaves.
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2 (KDD 2012 Tutorial)
Graphical Models
62 / 131
Inference
Belief Propagation
Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2 Start messages at leaves. If a leaf is a factor node f , µf →x (x) = f (x) µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2 (KDD 2012 Tutorial)
Graphical Models
62 / 131
Inference
Belief Propagation
Leaf Messages Assume tree factor graph. Pick an arbitrary root, say z2 Start messages at leaves. If a leaf is a factor node f , µf →x (x) = f (x) µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8 If a leaf is a variable node x, µx→f (x) = 1 f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2 (KDD 2012 Tutorial)
Graphical Models
62 / 131
Inference
Belief Propagation
Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived
f1
z1
P(z1)P(x1| z1)
(KDD 2012 Tutorial)
f2
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4) Models R G Graphical B R G B
63 / 131
Inference
Belief Propagation
Message from Variable to Factor A node (factor or variable) can send out a message if all other incoming messages have arrived Let x be in factor fs . ne(x)\fs are factors connected to x excluding fs . Y µx→fs (x) = µf →x (x) f ∈ne(x)\fs
µz1 →f2 (z1 = 1) = 1/4 µz1 →f2 (z1 = 2) = 1/8
f1
z1
P(z1)P(x1| z1)
(KDD 2012 Tutorial)
f2
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) P(x | z=2)=(1/4, 1/2, 1/4) Models R G Graphical B R G B
63 / 131
Inference
Belief Propagation
Message from Factor to Variable Let x be in factor fs . Let the other variables in fs be x1:M . µfs →x (x) =
X
...
x1
f1 P(z1)P(x1| z1) (KDD 2012 Tutorial)
X
fs (x, x1 , . . . , xM )
xM
z1
M Y
µxm →fs (xm )
m=1
f2
z2
P(z2| z1)P(x2| z2) Graphical Models
64 / 131
Inference
Belief Propagation
Message from Factor to Variable Let x be in factor fs . Let the other variables in fs be x1:M . µfs →x (x) =
X
...
x1
X
fs (x, x1 , . . . , xM )
xM
M Y
µxm →fs (xm )
m=1
In this example µf2 →z2 (s) =
2 X
µz1 →f2 (s0 )f2 (z1 = s0 , z2 = s)
s0 =1
= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s) +1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s)
f1 P(z1)P(x1| z1) (KDD 2012 Tutorial)
z1
f2
z2
P(z2| z1)P(x2| z2) Graphical Models
64 / 131
Inference
Belief Propagation
Message from Factor to Variable Let x be in factor fs . Let the other variables in fs be x1:M . µfs →x (x) =
X
...
x1
X
fs (x, x1 , . . . , xM )
xM
M Y
µxm →fs (xm )
m=1
In this example µf2 →z2 (s) =
2 X
µz1 →f2 (s0 )f2 (z1 = s0 , z2 = s)
s0 =1
= 1/4P (z2 = s|z1 = 1)P (x2 = G|z2 = s) +1/8P (z2 = s|z1 = 2)P (x2 = G|z2 = s) We get µf2 →z2 (z2 = 1) = 1/32, µf2 →z2 (z2 = 2) = 1/8 f1 P(z1)P(x1| z1) (KDD 2012 Tutorial)
z1
f2
z2
P(z2| z1)P(x2| z2) Graphical Models
64 / 131
Inference
Belief Propagation
Up to Root, Back Down The message has reached the root, pass it back down µz2 →f2 (z2 = 1) = 1 µz2 →f2 (z2 = 2) = 1
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
(KDD 2012 Tutorial)
Graphical Models
65 / 131
Inference
Belief Propagation
Keep Passing Down P µf2 →z1 (s) = 2s0 =1 µz2 →f2 (s0 )f2 (z1 = s, z2 = s0 ) = 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1) + 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2).
f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
(KDD 2012 Tutorial)
Graphical Models
66 / 131
Inference
Belief Propagation
Keep Passing Down P µf2 →z1 (s) = 2s0 =1 µz2 →f2 (s0 )f2 (z1 = s, z2 = s0 ) = 1P (z2 = 1|z1 = s)P (x2 = G|z2 = 1) + 1P (z2 = 2|z1 = s)P (x2 = G|z2 = 2). We get µf2 →z1 (z1 = 1) = 7/16 µf2 →z1 (z1 = 2) = 3/8 f1
z1
f2
P(z1)P(x1| z1)
z2
P(z2| z1)P(x2| z2) 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
(KDD 2012 Tutorial)
Graphical Models
66 / 131
Inference
Belief Propagation
From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as Y p(x) ∝ µf →x (x) f ∈ne(x)
(KDD 2012 Tutorial)
Graphical Models
67 / 131
Inference
Belief Propagation
From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as Y p(x) ∝ µf →x (x) f ∈ne(x)
In this example 1/4 P (z1 |x1 , x2 ) ∝ µf1 →z1 · µf2 →z1 = 1/8 · 1/32 P (z2 |x1 , x2 ) ∝ µf2 →z2 = 1/8 ⇒ 0.2 0.8
(KDD 2012 Tutorial)
Graphical Models
7/16 3/8
=
7/64 3/64
⇒
0.7 0.3
67 / 131
Inference
Belief Propagation
From Messages to Marginals Once a variable receives all incoming messages, we compute its marginal as Y p(x) ∝ µf →x (x) f ∈ne(x)
In this example 1/4 P (z1 |x1 , x2 ) ∝ µf1 →z1 · µf2 →z1 = 1/8 · 1/32 P (z2 |x1 , x2 ) ∝ µf2 →z2 = 1/8 ⇒ 0.2 0.8
7/16 3/8
=
7/64 3/64
⇒
0.7 0.3
One can also compute the marginal of the set of variables xs involved in a factor fs Y p(xs ) ∝ fs (xs ) µx→f (x) x∈ne(f )
(KDD 2012 Tutorial)
Graphical Models
67 / 131
Inference
Belief Propagation
Handling Evidence Observing x = v,
(KDD 2012 Tutorial)
Graphical Models
68 / 131
Inference
Belief Propagation
Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or
(KDD 2012 Tutorial)
Graphical Models
68 / 131
Inference
Belief Propagation
Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx→f (x) = 0 for all x 6= v
(KDD 2012 Tutorial)
Graphical Models
68 / 131
Inference
Belief Propagation
Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx→f (x) = 0 for all x 6= v
Observing XE ,
(KDD 2012 Tutorial)
Graphical Models
68 / 131
Inference
Belief Propagation
Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx→f (x) = 0 for all x 6= v
Observing XE , multiplying the incoming messages to x ∈ / XE gives the joint (not p(x|XE )): Y p(x, XE ) ∝ µf →x (x) f ∈ne(x)
(KDD 2012 Tutorial)
Graphical Models
68 / 131
Inference
Belief Propagation
Handling Evidence Observing x = v, we can absorb it in the factor (as we did); or set messages µx→f (x) = 0 for all x 6= v
Observing XE , multiplying the incoming messages to x ∈ / XE gives the joint (not p(x|XE )): Y p(x, XE ) ∝ µf →x (x) f ∈ne(x)
The conditional is easily obtained by normalization p(x, XE ) 0 x0 p(x , XE )
p(x|XE ) = P
(KDD 2012 Tutorial)
Graphical Models
68 / 131
Inference
Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph
(KDD 2012 Tutorial)
Graphical Models
69 / 131
Inference
Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence
(KDD 2012 Tutorial)
Graphical Models
69 / 131
Inference
Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence But convergence may not happen
(KDD 2012 Tutorial)
Graphical Models
69 / 131
Inference
Belief Propagation
Loopy Belief Propagation
So far, we assumed a tree graph When the factor graph contains loops, pass messages indefinitely until convergence But convergence may not happen But in many cases loopy BP still works well, empirically
(KDD 2012 Tutorial)
Graphical Models
69 / 131
Inference
Mean Field Algorithm
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
70 / 131
Inference
Mean Field Algorithm
Example: The Ising Model
θs xs
θst
xt
The random variables x take values in {0, 1}. X X 1 θs xs + pθ (x) = exp θst xs xt Z s∈V
(KDD 2012 Tutorial)
Graphical Models
(s,t)∈E
71 / 131
Inference
Mean Field Algorithm
The Conditional θs xs
θst
xt
Markovian: the conditional distribution for xs is p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s.
(KDD 2012 Tutorial)
Graphical Models
72 / 131
Inference
Mean Field Algorithm
The Conditional θs xs
θst
xt
Markovian: the conditional distribution for xs is p(xs | x−s ) = p(xs | xN (s) ) N (s) is the neighbors of s. This reduces to (recall Gibbs sampling) p(xs = 1 | xN (s) ) =
(KDD 2012 Tutorial)
1 exp(−(θs +
Graphical Models
P
t∈N (s) θst xt ))
+1
72 / 131
Inference
Mean Field Algorithm
The Mean Field Algorithm for Ising Model Gibbs sampling would draw xs from p(xs = 1 | xN (s) ) =
(KDD 2012 Tutorial)
1 exp(−(θs +
Graphical Models
P
t∈N (s) θst xt ))
+1
73 / 131
Inference
Mean Field Algorithm
The Mean Field Algorithm for Ising Model Gibbs sampling would draw xs from p(xs = 1 | xN (s) ) =
1 exp(−(θs +
P
t∈N (s) θst xt ))
+1
Instead, let µs be the estimated marginal p(xs = 1)
(KDD 2012 Tutorial)
Graphical Models
73 / 131
Inference
Mean Field Algorithm
The Mean Field Algorithm for Ising Model Gibbs sampling would draw xs from p(xs = 1 | xN (s) ) =
1 exp(−(θs +
P
t∈N (s) θst xt ))
+1
Instead, let µs be the estimated marginal p(xs = 1) Mean field algorithm: µs ←
(KDD 2012 Tutorial)
1 exp(−(θs +
P
t∈N (s) θst µt ))
Graphical Models
+1
73 / 131
Inference
Mean Field Algorithm
The Mean Field Algorithm for Ising Model Gibbs sampling would draw xs from p(xs = 1 | xN (s) ) =
1 exp(−(θs +
P
t∈N (s) θst xt ))
+1
Instead, let µs be the estimated marginal p(xs = 1) Mean field algorithm: µs ←
1 exp(−(θs +
P
t∈N (s) θst µt ))
+1
The µ’s are updated iteratively
(KDD 2012 Tutorial)
Graphical Models
73 / 131
Inference
Mean Field Algorithm
The Mean Field Algorithm for Ising Model Gibbs sampling would draw xs from p(xs = 1 | xN (s) ) =
1 exp(−(θs +
P
t∈N (s) θst xt ))
+1
Instead, let µs be the estimated marginal p(xs = 1) Mean field algorithm: µs ←
1 exp(−(θs +
P
t∈N (s) θst µt ))
+1
The µ’s are updated iteratively The Mean Field algorithm is coordinate ascent and guaranteed to converge to a local optimal (more later). (KDD 2012 Tutorial)
Graphical Models
73 / 131
Inference
Exponential Family and Variational Inference
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
74 / 131
Inference
Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R
(KDD 2012 Tutorial)
Graphical Models
75 / 131
Inference
Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model
(KDD 2012 Tutorial)
Graphical Models
75 / 131
Inference
Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model φi (X) sometimes called a feature function
(KDD 2012 Tutorial)
Graphical Models
75 / 131
Inference
Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model φi (X) sometimes called a feature function Let θ = (θ1 , . . . , θd )> ∈ Rd be canonical parameters.
(KDD 2012 Tutorial)
Graphical Models
75 / 131
Inference
Exponential Family and Variational Inference
Exponential Family
Let φ(X) = (φ1 (X), . . . , φd (X))> be d sufficient statistics, where φi : X 7→ R Note X is all the nodes in a Graphical model φi (X) sometimes called a feature function Let θ = (θ1 , . . . , θd )> ∈ Rd be canonical parameters. The exponential family is a family of probability densities: pθ (x) = exp θ> φ(x) − A(θ)
(KDD 2012 Tutorial)
Graphical Models
75 / 131
Inference
Exponential Family and Variational Inference
Exponential Family pθ (x) = exp θ> φ(x) − A(θ) Inner product between parameters θ and sufficient statistics φ. pθ (x) > pθ (y) > pθ (z) φ( ) x
θ
φ( y)
φ( z)
(KDD 2012 Tutorial)
Graphical Models
76 / 131
Inference
Exponential Family and Variational Inference
Exponential Family pθ (x) = exp θ> φ(x) − A(θ) Inner product between parameters θ and sufficient statistics φ. pθ (x) > pθ (y) > pθ (z) φ( ) x
θ
φ( y)
φ( z)
A is the log partition function, Z A(θ) = log exp θ> φ(x) ν(dx)
(KDD 2012 Tutorial)
Graphical Models
76 / 131
Inference
Exponential Family and Variational Inference
Exponential Family pθ (x) = exp θ> φ(x) − A(θ) Inner product between parameters θ and sufficient statistics φ. pθ (x) > pθ (y) > pθ (z) φ( ) x
θ
φ( y)
φ( z)
A is the log partition function, Z A(θ) = log exp θ> φ(x) ν(dx) A = log Z (KDD 2012 Tutorial)
Graphical Models
76 / 131
Inference
Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞}
(KDD 2012 Tutorial)
Graphical Models
77 / 131
Inference
Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞} A minimal exponential family is where the φ’s are linearly independent.
(KDD 2012 Tutorial)
Graphical Models
77 / 131
Inference
Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞} A minimal exponential family is where the φ’s are linearly independent. An overcomplete exponential family is where the φ’s are linearly dependent: ∃α ∈ Rd , α> φ(x) = constant ∀x
(KDD 2012 Tutorial)
Graphical Models
77 / 131
Inference
Exponential Family and Variational Inference
Minimal vs. Overcomplete Models
Parameters for which the density is normalizable: Ω = {θ ∈ Rd | A(θ) < ∞} A minimal exponential family is where the φ’s are linearly independent. An overcomplete exponential family is where the φ’s are linearly dependent: ∃α ∈ Rd , α> φ(x) = constant ∀x Both minimal and overcomplete representations are useful.
(KDD 2012 Tutorial)
Graphical Models
77 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family!
(KDD 2012 Tutorial)
Graphical Models
78 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family! Can be rewritten as p(x) = exp (x log β + (1 − x) log(1 − β))
(KDD 2012 Tutorial)
Graphical Models
78 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family! Can be rewritten as p(x) = exp (x log β + (1 − x) log(1 − β)) Now in exponential family form with φ1 (x) = x, φ2 (x) = 1 − x, θ1 = log β, θ2 = log(1 − β), and A(θ) = 0.
(KDD 2012 Tutorial)
Graphical Models
78 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = β x (1 − β)1−x for x ∈ {0, 1} and β ∈ (0, 1). Does not look like an exponential family! Can be rewritten as p(x) = exp (x log β + (1 − x) log(1 − β)) Now in exponential family form with φ1 (x) = x, φ2 (x) = 1 − x, θ1 = log β, θ2 = log(1 − β), and A(θ) = 0. Overcomplete: α1 = α2 = 1 makes α> φ(x) = 1 for all x
(KDD 2012 Tutorial)
Graphical Models
78 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ)))
(KDD 2012 Tutorial)
Graphical Models
79 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ))) Minimal exponential family with β φ(x) = x, θ = log 1−β , A(θ) = log(1 + exp(θ)).
(KDD 2012 Tutorial)
Graphical Models
79 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ))) Minimal exponential family with β φ(x) = x, θ = log 1−β , A(θ) = log(1 + exp(θ)). Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are in the exponential family
(KDD 2012 Tutorial)
Graphical Models
79 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 1: Bernoulli
p(x) = exp (x log β + (1 − x) log(1 − β)) Can be further rewritten as p(x) = exp (xθ − log(1 + exp(θ))) Minimal exponential family with β φ(x) = x, θ = log 1−β , A(θ) = log(1 + exp(θ)). Many distributions (e.g., Gaussian, exponential, Poisson, Beta) are in the exponential family But not all (e.g., the Laplace distribution).
(KDD 2012 Tutorial)
Graphical Models
79 / 131
Exponential Family and Variational Inference
Inference
Exponential Family Example 2: Ising Model
θs xs
θst
xt
pθ (x) = exp
X s∈V
X
θs xs +
θst xs xt − A(θ)
(s,t)∈E
Binary random variable xs ∈ {0, 1}
(KDD 2012 Tutorial)
Graphical Models
80 / 131
Exponential Family and Variational Inference
Inference
Exponential Family Example 2: Ising Model
θs xs
θst
xt
pθ (x) = exp
X s∈V
X
θs xs +
θst xs xt − A(θ)
(s,t)∈E
Binary random variable xs ∈ {0, 1} d = |V | + |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)>
(KDD 2012 Tutorial)
Graphical Models
80 / 131
Exponential Family and Variational Inference
Inference
Exponential Family Example 2: Ising Model
θs xs
θst
xt
pθ (x) = exp
X s∈V
X
θs xs +
θst xs xt − A(θ)
(s,t)∈E
Binary random variable xs ∈ {0, 1} d = |V | + |E| sufficient statistics: φ(x) = (. . . xs . . . xst . . .)> This is a regular (Ω = Rd ), minimal exponential family.
(KDD 2012 Tutorial)
Graphical Models
80 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model θs xs
θst
xt
Similar to Ising model but generalizing xs ∈ {0, . . . , r − 1}.
(KDD 2012 Tutorial)
Graphical Models
81 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model θs xs
θst
xt
Similar to Ising model but generalizing xs ∈ {0, . . . , r − 1}. Indicator functions fsj (x) = 1 if xs = j and 0 otherwise, and fstjk (x) = 1 if xs = j ∧ xt = k, and 0 otherwise. X X pθ (x) = exp θstjk fstjk (x) − A(θ) θsj fsj (x) + sj
(KDD 2012 Tutorial)
stjk
Graphical Models
81 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model θs xs
θst
xt
pθ (x) = exp
X sj
θsj fsj (x) +
X
θstjk fstjk (x) − A(θ)
stjk
d = r|V | + r2 |E|
(KDD 2012 Tutorial)
Graphical Models
82 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model θs xs
θst
xt
pθ (x) = exp
X
θsj fsj (x) +
sj
X
θstjk fstjk (x) − A(θ)
stjk
d = r|V | + r2 |E| Regular but overcomplete, because and all x.
(KDD 2012 Tutorial)
Pr−1
Graphical Models
j=0 θsj (x)
= 1 for any s ∈ V
82 / 131
Inference
Exponential Family and Variational Inference
Exponential Family Example 3: Potts Model θs xs
θst
xt
pθ (x) = exp
X
θsj fsj (x) +
sj
X
θstjk fstjk (x) − A(θ)
stjk
d = r|V | + r2 |E| Regular but overcomplete, because and all x.
Pr−1
j=0 θsj (x)
= 1 for any s ∈ V
The Potts model is a special case where the parameters are tied: θstkk = α, and θstjk = β for j 6= k. (KDD 2012 Tutorial)
Graphical Models
82 / 131
Inference
Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise
(KDD 2012 Tutorial)
Graphical Models
83 / 131
Inference
Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise The marginal can be obtained via the mean Eθ [φsj (x)] = P (xs = j)
(KDD 2012 Tutorial)
Graphical Models
83 / 131
Inference
Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise The marginal can be obtained via the mean Eθ [φsj (x)] = P (xs = j) Since inference is often about computing the marginal, in this case it is equivalent to computing the mean.
(KDD 2012 Tutorial)
Graphical Models
83 / 131
Inference
Exponential Family and Variational Inference
Important Relationship
For sufficient statistics defined by indicator functions, e.g., φsj (x) = fsj (x) = 1 if xs = j and 0 otherwise The marginal can be obtained via the mean Eθ [φsj (x)] = P (xs = j) Since inference is often about computing the marginal, in this case it is equivalent to computing the mean. We will come back to this point when we discuss variational inference...
(KDD 2012 Tutorial)
Graphical Models
83 / 131
Inference
Exponential Family and Variational Inference
Mean Parameters Let p be any density (not necessarily in exponential family).
(KDD 2012 Tutorial)
Graphical Models
84 / 131
Inference
Exponential Family and Variational Inference
Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx
(KDD 2012 Tutorial)
Graphical Models
84 / 131
Inference
Exponential Family and Variational Inference
Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ}
(KDD 2012 Tutorial)
Graphical Models
84 / 131
Inference
Exponential Family and Variational Inference
Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ} If µ(1) , µ(2) ∈ M, there must exist p(1) , p(2)
(KDD 2012 Tutorial)
Graphical Models
84 / 131
Inference
Exponential Family and Variational Inference
Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ} If µ(1) , µ(2) ∈ M, there must exist p(1) , p(2) The convex combinations of p(1) , p(2) leads to another mean parameter in M
(KDD 2012 Tutorial)
Graphical Models
84 / 131
Inference
Exponential Family and Variational Inference
Mean Parameters Let p be any density (not necessarily in exponential family). Given sufficient statistics φ, the mean parameters µ = (µ1 , . . . , µd )> is Z µi = Ep [φi (x)] = φi (x)p(x)dx The set of mean parameters define a convex set: M = {µ ∈ Rd | ∃p s.t. Ep [φ(x)] = µ} If µ(1) , µ(2) ∈ M, there must exist p(1) , p(2) The convex combinations of p(1) , p(2) leads to another mean parameter in M Therefore M is convex (KDD 2012 Tutorial)
Graphical Models
84 / 131
Inference
Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1 (x) = x, φ2 (x) = x2
(KDD 2012 Tutorial)
Graphical Models
85 / 131
Inference
Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1 (x) = x, φ2 (x) = x2 For any p (not necessarily Gaussian) on x, the mean parameters µ = (µ1 , µ2 ) = (E(x), E(x2 ))> .
(KDD 2012 Tutorial)
Graphical Models
85 / 131
Inference
Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1 (x) = x, φ2 (x) = x2 For any p (not necessarily Gaussian) on x, the mean parameters µ = (µ1 , µ2 ) = (E(x), E(x2 ))> . Note V(x) = E(x2 ) − E2 (x) = µ2 − µ21 ≥ 0 for any p
(KDD 2012 Tutorial)
Graphical Models
85 / 131
Inference
Exponential Family and Variational Inference
Example: The First Two Moments
Let φ1 (x) = x, φ2 (x) = x2 For any p (not necessarily Gaussian) on x, the mean parameters µ = (µ1 , µ2 ) = (E(x), E(x2 ))> . Note V(x) = E(x2 ) − E2 (x) = µ2 − µ21 ≥ 0 for any p M is not R × R+ but rather the subset {(µ1 , µ2 ) | µ2 ≥ µ21 }.
(KDD 2012 Tutorial)
Graphical Models
85 / 131
Inference
Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs
(KDD 2012 Tutorial)
Graphical Models
86 / 131
Inference
Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p}
(KDD 2012 Tutorial)
Graphical Models
86 / 131
Inference
Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p} p can be a point mass function on a particular x.
(KDD 2012 Tutorial)
Graphical Models
86 / 131
Inference
Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p} p can be a point mass function on a particular x. In fact any p is a convex combination of such point mass functions.
(KDD 2012 Tutorial)
Graphical Models
86 / 131
Inference
Exponential Family and Variational Inference
The Marginal Polytope
The marginal polytope is defined for discrete xs P Recall M = {µ ∈ Rd | µ = x φ(x)p(x) for some p} p can be a point mass function on a particular x. In fact any p is a convex combination of such point mass functions. M = conv{φ(x), ∀x} is a convex hull, called the marginal polytope.
(KDD 2012 Tutorial)
Graphical Models
86 / 131
Inference
Exponential Family and Variational Inference
Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> .
(KDD 2012 Tutorial)
Graphical Models
87 / 131
Inference
Exponential Family and Variational Inference
Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> . only 4 different x = (x1 , x2 ).
(KDD 2012 Tutorial)
Graphical Models
87 / 131
Inference
Exponential Family and Variational Inference
Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> . only 4 different x = (x1 , x2 ). the marginal polytope is M = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}
(KDD 2012 Tutorial)
Graphical Models
87 / 131
Inference
Exponential Family and Variational Inference
Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> . only 4 different x = (x1 , x2 ). the marginal polytope is M = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}
the convex hull is a polytope inside the unit cube.
(KDD 2012 Tutorial)
Graphical Models
87 / 131
Inference
Exponential Family and Variational Inference
Marginal Polytope Example Tiny Ising model: two nodes x1 , x2 ∈ {0, 1} connected by an edge. minimal sufficient statistics φ(x1 , x2 ) = (x1 , x2 , x1 x2 )> . only 4 different x = (x1 , x2 ). the marginal polytope is M = conv{(0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 1)}
the convex hull is a polytope inside the unit cube. the three coordinates are node marginals µ1 ≡ Ep [x1 = 1], µ2 ≡ Ep [x2 = 1] and edge marginal µ12 ≡ Ep [x1 = x2 = 1], hence the name. (KDD 2012 Tutorial)
Graphical Models
87 / 131
Inference
Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ.
(KDD 2012 Tutorial)
Graphical Models
88 / 131
Inference
Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ. Strictly convex for minimal exponential family.
(KDD 2012 Tutorial)
Graphical Models
88 / 131
Inference
Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ. Strictly convex for minimal exponential family. Nice property:
(KDD 2012 Tutorial)
∂A(θ) ∂θi
= Eθ [φi (x)]
Graphical Models
88 / 131
Inference
Exponential Family and Variational Inference
The Log Partition Function A
For any regular exponential family, A(θ) is convex in θ. Strictly convex for minimal exponential family. Nice property:
∂A(θ) ∂θi
= Eθ [φi (x)]
Therefore, ∇A = µ, the mean parameters of pθ .
(KDD 2012 Tutorial)
Graphical Models
88 / 131
Inference
Exponential Family and Variational Inference
Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω
(KDD 2012 Tutorial)
Graphical Models
89 / 131
Inference
Exponential Family and Variational Inference
Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω
Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition.
(KDD 2012 Tutorial)
Graphical Models
89 / 131
Inference
Exponential Family and Variational Inference
Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω
Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ
(KDD 2012 Tutorial)
Graphical Models
89 / 131
Inference
Exponential Family and Variational Inference
Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω
Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ Then A∗ (µ) = −H(pθ(µ) ), the negative entropy.
(KDD 2012 Tutorial)
Graphical Models
89 / 131
Inference
Exponential Family and Variational Inference
Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω
Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ Then A∗ (µ) = −H(pθ(µ) ), the negative entropy. The dual of the dual gives back A: A(θ) = supµ∈M µ> θ − A∗ (µ)
(KDD 2012 Tutorial)
Graphical Models
89 / 131
Inference
Exponential Family and Variational Inference
Conjugate Duality The conjugate dual function A∗ to A is defined as A∗ (µ) = sup θ> µ − A(θ) θ∈Ω
Such definition, where a quantity is expressed as the solution to an optimization problem, is called a variational definition. For any µ ∈ M’s interior, let θ(µ) satisfy Eθ(µ) [φ(x)] = ∇A(θ(µ)) = µ Then A∗ (µ) = −H(pθ(µ) ), the negative entropy. The dual of the dual gives back A: A(θ) = supµ∈M µ> θ − A∗ (µ) For all θ ∈ Ω, the supremum is attained uniquely at the µ ∈ M0 by the moment matching conditions µ = Eθ [φ(x)].
(KDD 2012 Tutorial)
Graphical Models
89 / 131
Inference
Exponential Family and Variational Inference
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli with φ(x) = x, A(θ) = log(1 + exp(θ)), Ω = R.
(KDD 2012 Tutorial)
Graphical Models
90 / 131
Inference
Exponential Family and Variational Inference
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli with φ(x) = x, A(θ) = log(1 + exp(θ)), Ω = R. By definition A∗ (µ) = sup θµ − log(1 + exp(θ)) θ∈R
(KDD 2012 Tutorial)
Graphical Models
90 / 131
Inference
Exponential Family and Variational Inference
Example: Conjugate Dual for Bernoulli
Recall the minimal exponential family for Bernoulli with φ(x) = x, A(θ) = log(1 + exp(θ)), Ω = R. By definition A∗ (µ) = sup θµ − log(1 + exp(θ)) θ∈R
Taking derivative and solve A∗ (µ) = µ log µ + (1 − µ) log(1 − µ) i.e., the negative entropy.
(KDD 2012 Tutorial)
Graphical Models
90 / 131
Inference
Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The supremum is attained by µ = Eθ [φ(x)].
(KDD 2012 Tutorial)
Graphical Models
91 / 131
Inference
Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation.
(KDD 2012 Tutorial)
Graphical Models
91 / 131
Inference
Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation. Want to compute the mean parameters µsj ? They are the arg sup to the optimization problem above.
(KDD 2012 Tutorial)
Graphical Models
91 / 131
Inference
Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation. Want to compute the mean parameters µsj ? They are the arg sup to the optimization problem above. This variational representation is exact, not approximate
(KDD 2012 Tutorial)
Graphical Models
91 / 131
Inference
Exponential Family and Variational Inference
Inference with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The supremum is attained by µ = Eθ [φ(x)]. Want to compute the marginals P (xs = j)? They are the mean parameters µsj = Eθ [φij (x)] under standard overcomplete representation. Want to compute the mean parameters µsj ? They are the arg sup to the optimization problem above. This variational representation is exact, not approximate We will relax it next to derive loopy BP and mean field
(KDD 2012 Tutorial)
Graphical Models
91 / 131
Inference
Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
Difficult to solve even though it is a convex problem
(KDD 2012 Tutorial)
Graphical Models
92 / 131
Inference
Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
Difficult to solve even though it is a convex problem Two issues:
(KDD 2012 Tutorial)
Graphical Models
92 / 131
Inference
Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
Difficult to solve even though it is a convex problem Two issues: Although the marginal polytope M is convex, it can be quite complex (exponential number of vertices)
(KDD 2012 Tutorial)
Graphical Models
92 / 131
Inference
Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
Difficult to solve even though it is a convex problem Two issues: Although the marginal polytope M is convex, it can be quite complex (exponential number of vertices) The dual function A∗ (µ) usually does not admit an explicit form.
(KDD 2012 Tutorial)
Graphical Models
92 / 131
Inference
Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
Difficult to solve even though it is a convex problem Two issues: Although the marginal polytope M is convex, it can be quite complex (exponential number of vertices) The dual function A∗ (µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so that it is tractable, at the price of an approximate solution.
(KDD 2012 Tutorial)
Graphical Models
92 / 131
Inference
Exponential Family and Variational Inference
The Difficulties with Variational Representation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
Difficult to solve even though it is a convex problem Two issues: Although the marginal polytope M is convex, it can be quite complex (exponential number of vertices) The dual function A∗ (µ) usually does not admit an explicit form.
Variational approximation modifies the optimization problem so that it is tractable, at the price of an approximate solution. Next, we cast mean field and sum-product algorithms as variational approximations.
(KDD 2012 Tutorial)
Graphical Models
92 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form.
(KDD 2012 Tutorial)
Graphical Models
93 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form. Consider the fully disconnected subgraph F = (V, ∅) of the original graph G = (V, E)
(KDD 2012 Tutorial)
Graphical Models
93 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form. Consider the fully disconnected subgraph F = (V, ∅) of the original graph G = (V, E) Set all θi = 0 if φi involves edges
(KDD 2012 Tutorial)
Graphical Models
93 / 131
Exponential Family and Variational Inference
Inference
The Mean Field Method as Variational Approximation
The mean field method replaces M with a simpler subset M(F ) on which A∗ (µ) has a closed form. Consider the fully disconnected subgraph F = (V, ∅) of the original graph G = (V, E) Set all θi = 0 if φi involves edges The densities in this sub-family are all fully factorized: Y p(xs ; θs ) pθ (x) = s∈V
(KDD 2012 Tutorial)
Graphical Models
93 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}.
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ).
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example:
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )>
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )> The point mass distribution p(x = (0, 1)> ) = 1 is realized as a limit to the series p(x) = exp(θ1 x1 + θ2 x2 − A(θ)) where θ1 → −∞ and θ2 → ∞.
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )> The point mass distribution p(x = (0, 1)> ) = 1 is realized as a limit to the series p(x) = exp(θ1 x1 + θ2 x2 − A(θ)) where θ1 → −∞ and θ2 → ∞. This series is in F because θ12 = 0.
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F ) Let M(F ) be the mean parameters of the fully factorized sub-family. In general, M(F ) ⊂ M Recall M is the convex hull of extreme points {φ(x)}. It turns out the extreme points {φ(x)} ∈ M(F ). Example: The tiny Ising model x1 , x2 ∈ {0, 1} with φ = (x1 , x2 , x1 x2 )> The point mass distribution p(x = (0, 1)> ) = 1 is realized as a limit to the series p(x) = exp(θ1 x1 + θ2 x2 − A(θ)) where θ1 → −∞ and θ2 → ∞. This series is in F because θ12 = 0. Hence the extreme point φ(x) = (0, 1, 0) is in M(F ).
(KDD 2012 Tutorial)
Graphical Models
94 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F )
[courtesy Nick Bridle]
Because the extreme points of M are in M(F ), if M(F ) were convex, we would have M = M(F ).
(KDD 2012 Tutorial)
Graphical Models
95 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F )
[courtesy Nick Bridle]
Because the extreme points of M are in M(F ), if M(F ) were convex, we would have M = M(F ). But in general M(F ) is a true subset of M
(KDD 2012 Tutorial)
Graphical Models
95 / 131
Inference
Exponential Family and Variational Inference
The Geometry of M(F )
[courtesy Nick Bridle]
Because the extreme points of M are in M(F ), if M(F ) were convex, we would have M = M(F ). But in general M(F ) is a true subset of M Therefore, M(F ) is a nonconvex inner set of M
(KDD 2012 Tutorial)
Graphical Models
95 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M
attained by solution to inference problem µ = Eθ [φ(x)].
(KDD 2012 Tutorial)
Graphical Models
96 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M
attained by solution to inference problem µ = Eθ [φ(x)]. The mean field method simply replaces M with M(F ) L(θ) =
sup µ> θ − A∗ (µ) µ∈M(F )
(KDD 2012 Tutorial)
Graphical Models
96 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M
attained by solution to inference problem µ = Eθ [φ(x)]. The mean field method simply replaces M with M(F ) L(θ) =
sup µ> θ − A∗ (µ) µ∈M(F )
Obvious L(θ) ≤ A(θ).
(KDD 2012 Tutorial)
Graphical Models
96 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M
attained by solution to inference problem µ = Eθ [φ(x)]. The mean field method simply replaces M with M(F ) L(θ) =
sup µ> θ − A∗ (µ) µ∈M(F )
Obvious L(θ) ≤ A(θ). The original solution µ∗ may not be in M(F )
(KDD 2012 Tutorial)
Graphical Models
96 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M
attained by solution to inference problem µ = Eθ [φ(x)]. The mean field method simply replaces M with M(F ) L(θ) =
sup µ> θ − A∗ (µ) µ∈M(F )
Obvious L(θ) ≤ A(θ). The original solution µ∗ may not be in M(F ) Even if µ∗ ∈ M(F ), may hit local maximum and not find it
(KDD 2012 Tutorial)
Graphical Models
96 / 131
Inference
Exponential Family and Variational Inference
The Mean Field Method as Variational Approximation Recall the exact variational problem A(θ) = sup µ> θ − A∗ (µ) µ∈M
attained by solution to inference problem µ = Eθ [φ(x)]. The mean field method simply replaces M with M(F ) L(θ) =
sup µ> θ − A∗ (µ) µ∈M(F )
Obvious L(θ) ≤ A(θ). The original solution µ∗ may not be in M(F ) Even if µ∗ ∈ M(F ), may hit local maximum and not find it Why both? Because A∗ (µ) = −H(pθ (µ)) has a very simple form for M (F ) (KDD 2012 Tutorial)
Graphical Models
96 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1)
(KDD 2012 Tutorial)
Graphical Models
97 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1) Fully factorized M(F ) means no edge. µst = µs µt
(KDD 2012 Tutorial)
Graphical Models
97 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1) Fully factorized M(F ) means no edge. µst = µs µt For M(F ), the dual function A∗ (µ) has the simple form X X A∗ (µ) = −H(µs ) = µs log µs + (1 − µs ) log(1 − µs ) s∈V
(KDD 2012 Tutorial)
s∈V
Graphical Models
97 / 131
Exponential Family and Variational Inference
Inference
Example: Mean Field for Ising Model The mean parameters for the Ising model are the node and edge marginals: µs = p(xx = 1), µst = p(xs = 1, xt = 1) Fully factorized M(F ) means no edge. µst = µs µt For M(F ), the dual function A∗ (µ) has the simple form X X A∗ (µ) = −H(µs ) = µs log µs + (1 − µs ) log(1 − µs ) s∈V
s∈V
Thus the mean field problem is X L(θ) = sup µ> θ − (µs log µs + (1 − µs ) log(1 − µs )) µ∈M(F )
s∈V
=
max
(µ1 ...µm )∈[0,1]m
(KDD 2012 Tutorial)
X
θs µs +
s∈V
Graphical Models
X (s,t)∈E
θst µs µt +
X
H(µs )
s∈V
97 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model L(θ) =
max
(µ1 ...µm )∈[0,1]m
X
θs µs +
s∈V
X (s,t)∈E
θst µs µt +
X
H(µs )
s∈V
Bilinear in µ, not jointly concave
(KDD 2012 Tutorial)
Graphical Models
98 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model L(θ) =
max
(µ1 ...µm )∈[0,1]m
X
s∈V
θs µs +
X
θst µs µt +
(s,t)∈E
X
H(µs )
s∈V
Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others.
(KDD 2012 Tutorial)
Graphical Models
98 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model L(θ) =
max
(µ1 ...µm )∈[0,1]m
X
s∈V
θs µs +
X (s,t)∈E
θst µs µt +
X
H(µs )
s∈V
Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others. Iterative coordinate-wise maximization: fixing µt for t 6= s and optimizing µs .
(KDD 2012 Tutorial)
Graphical Models
98 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model L(θ) =
X
max
(µ1 ...µm )∈[0,1]m
θs µs +
s∈V
X (s,t)∈E
θst µs µt +
X
H(µs )
s∈V
Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others. Iterative coordinate-wise maximization: fixing µt for t 6= s and optimizing µs . Setting the partial derivative w.r.t. µs to 0 yields: µs =
1 P 1 + exp −(θs + (s,t)∈E θst µt )
as we’ve seen before.
(KDD 2012 Tutorial)
Graphical Models
98 / 131
Inference
Exponential Family and Variational Inference
Example: Mean Field for Ising Model L(θ) =
max
X
(µ1 ...µm )∈[0,1]m
θs µs +
s∈V
X (s,t)∈E
θst µs µt +
X
H(µs )
s∈V
Bilinear in µ, not jointly concave But concave in a single dimension µs , fixing others. Iterative coordinate-wise maximization: fixing µt for t 6= s and optimizing µs . Setting the partial derivative w.r.t. µs to 0 yields: µs =
1 P 1 + exp −(θs + (s,t)∈E θst µt )
as we’ve seen before. Caution: mean field converges to a local maximum depending on the initialization of µ1 . . . µm . (KDD 2012 Tutorial)
Graphical Models
98 / 131
Inference
Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The sum-product algorithm makes two approximations: it relaxes M to an outer set L
(KDD 2012 Tutorial)
Graphical Models
99 / 131
Inference
Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation
A(θ) = sup µ> θ − A∗ (µ) µ∈M
The sum-product algorithm makes two approximations: it relaxes M to an outer set L it replaces the dual A∗ with an approximation A˜∗ . A(θ) = sup µ> θ − A˜∗ (µ) µ∈L
(KDD 2012 Tutorial)
Graphical Models
99 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k).
(KDD 2012 Tutorial)
Graphical Models
100 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k). The marginal polytope is M = {µ | ∃p with marginals µ}.
(KDD 2012 Tutorial)
Graphical Models
100 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k). The marginal polytope is M = {µ | ∃p with marginals µ}. Now consider τ ∈ Rd+ satisfying “node normalization” and “edge-node marginal consistency” conditions: r−1 X
τsj = 1
∀s ∈ V
j=0 r−1 X k=0 r−1 X
τstjk = τsj
∀s, t ∈ V, j = 0 . . . r − 1
τstjk = τtk
∀s, t ∈ V, k = 0 . . . r − 1
j=0
(KDD 2012 Tutorial)
Graphical Models
100 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation For overcomplete exponential families on discrete nodes, the mean parameters are node and edge marginals µsj = p(xs = j), µstjk = p(xs = j, xt = k). The marginal polytope is M = {µ | ∃p with marginals µ}. Now consider τ ∈ Rd+ satisfying “node normalization” and “edge-node marginal consistency” conditions: r−1 X
τsj = 1
∀s ∈ V
j=0 r−1 X k=0 r−1 X
τstjk = τsj
∀s, t ∈ V, j = 0 . . . r − 1
τstjk = τtk
∀s, t ∈ V, k = 0 . . . r − 1
j=0
Define L = {τ satisfying the above conditions}. (KDD 2012 Tutorial)
Graphical Models
100 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation If the graph is a tree then M = L
(KDD 2012 Tutorial)
Graphical Models
101 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation If the graph is a tree then M = L If the graph has cycles then M ⊂ L: L does not enforce other constraints that true marginals need to satisfy
(KDD 2012 Tutorial)
Graphical Models
101 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation If the graph is a tree then M = L If the graph has cycles then M ⊂ L: L does not enforce other constraints that true marginals need to satisfy Nice property: L is still a polytope, but much simpler than M. τ
L
M
φ(x)
(KDD 2012 Tutorial)
Graphical Models
101 / 131
Inference
Exponential Family and Variational Inference
The Outer Relaxation If the graph is a tree then M = L If the graph has cycles then M ⊂ L: L does not enforce other constraints that true marginals need to satisfy Nice property: L is still a polytope, but much simpler than M. τ
L
M
φ(x)
The first approximation in sum-product is to replace M with L. (KDD 2012 Tutorial)
Graphical Models
101 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals
(KDD 2012 Tutorial)
Graphical Models
102 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals If the graph is a tree, one can exactly reconstruct the joint probability Y Y µstx x s t pµ (x) = µsxs µsxs µtxt s∈V
(KDD 2012 Tutorial)
(s,t)∈E
Graphical Models
102 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals If the graph is a tree, one can exactly reconstruct the joint probability Y Y µstx x s t pµ (x) = µsxs µsxs µtxt s∈V
(s,t)∈E
If the graph is a tree, the entropy of the joint distribution is H(pµ ) = −
r−1 XX
µsj log µsj −
s∈V j=0
(KDD 2012 Tutorial)
X X (s,t)∈E j,k
Graphical Models
µstjk log
µstjk µsj µtk
102 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Recall µ are node and edge marginals If the graph is a tree, one can exactly reconstruct the joint probability Y Y µstx x s t pµ (x) = µsxs µsxs µtxt s∈V
(s,t)∈E
If the graph is a tree, the entropy of the joint distribution is H(pµ ) = −
r−1 XX
µsj log µsj −
s∈V j=0
X X (s,t)∈E j,k
µstjk log
µstjk µsj µtk
Neither holds for graph with cycles.
(KDD 2012 Tutorial)
Graphical Models
102 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Define the Bethe entropy for τ ∈ L on loopy graphs in the same way: HBethe (pτ ) = −
r−1 XX
τsj log τsj −
s∈V j=0
(KDD 2012 Tutorial)
Graphical Models
X X (s,t)∈E j,k
τstjk log
τstjk τsj τtk
103 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Define the Bethe entropy for τ ∈ L on loopy graphs in the same way: HBethe (pτ ) = −
r−1 XX
τsj log τsj −
s∈V j=0
X X (s,t)∈E j,k
τstjk log
τstjk τsj τtk
Note τ is not a true marginal, and HBethe is not a true entropy.
(KDD 2012 Tutorial)
Graphical Models
103 / 131
Inference
Exponential Family and Variational Inference
Approximating A∗
Define the Bethe entropy for τ ∈ L on loopy graphs in the same way: HBethe (pτ ) = −
r−1 XX
τsj log τsj −
s∈V j=0
X X (s,t)∈E j,k
τstjk log
τstjk τsj τtk
Note τ is not a true marginal, and HBethe is not a true entropy. The second approximation in sum-product is to replace A∗ (τ ) with −HBethe (pτ ).
(KDD 2012 Tutorial)
Graphical Models
103 / 131
Inference
Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation With these two approximations, we arrive at the variational problem ˜ A(θ) = sup τ > θ + HBethe (pτ ) τ ∈L
Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L.
(KDD 2012 Tutorial)
Graphical Models
104 / 131
Inference
Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation With these two approximations, we arrive at the variational problem ˜ A(θ) = sup τ > θ + HBethe (pτ ) τ ∈L
Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L. The sum-product algorithm can be derived as an iterative fixed point procedure to satisfy optimality conditions.
(KDD 2012 Tutorial)
Graphical Models
104 / 131
Inference
Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation With these two approximations, we arrive at the variational problem ˜ A(θ) = sup τ > θ + HBethe (pτ ) τ ∈L
Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L. The sum-product algorithm can be derived as an iterative fixed point procedure to satisfy optimality conditions. ˜ At the solution, A(θ) is not guaranteed to be either an upper or a lower bound of A(θ)
(KDD 2012 Tutorial)
Graphical Models
104 / 131
Inference
Exponential Family and Variational Inference
The Sum-Product Algorithm as Variational Approximation With these two approximations, we arrive at the variational problem ˜ A(θ) = sup τ > θ + HBethe (pτ ) τ ∈L
Optimality conditions require the gradients vanish w.r.t. both τ and the Lagrangian multipliers on constraints τ ∈ L. The sum-product algorithm can be derived as an iterative fixed point procedure to satisfy optimality conditions. ˜ At the solution, A(θ) is not guaranteed to be either an upper or a lower bound of A(θ) τ may not correspond to a true marginal distribution
(KDD 2012 Tutorial)
Graphical Models
104 / 131
Inference
Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation)
(KDD 2012 Tutorial)
Graphical Models
105 / 131
Inference
Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation) The mean field method
(KDD 2012 Tutorial)
Graphical Models
105 / 131
Inference
Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation) The mean field method Not covered: Expectation Propagation
(KDD 2012 Tutorial)
Graphical Models
105 / 131
Inference
Exponential Family and Variational Inference
Summary: Variational Inference
The sum-product algorithm (loopy belief propagation) The mean field method Not covered: Expectation Propagation Efficient computation. But often unknown bias in solution.
(KDD 2012 Tutorial)
Graphical Models
105 / 131
Inference
Maximizing Problems
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
106 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
(KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
We can define “best” as zn∗ = arg maxk p(zn = k|x1:N )
(KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best
(KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best ∗ In fact z1:N can even have zero probability!
(KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best ∗ In fact z1:N can even have zero probability!
2
An alternative is to find ∗ z1:N = arg max p(z1:N |x1:N ) z1:N
(KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best ∗ In fact z1:N can even have zero probability!
2
An alternative is to find ∗ z1:N = arg max p(z1:N |x1:N ) z1:N
finds the most likely state configuration as a whole (KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Maximizing Problems Recall the HMM example 1/4
1/2
1
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
There are two senses of “best states” z1:N given x1:N : 1 So far we computed the marginal p(z |x n 1:N )
We can define “best” as zn∗ = arg maxk p(zn = k|x1:N ) ∗ However z1:N as a whole may not be the best ∗ In fact z1:N can even have zero probability!
2
An alternative is to find ∗ z1:N = arg max p(z1:N |x1:N ) z1:N
finds the most likely state configuration as a whole The max-sum algorithm solves this, generalizes the Viterbi algorithm for HMMs (KDD 2012 Tutorial)
Graphical Models
107 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
Simple modification to the sum-product algorithm: replace the factor-to-variable messages. µfs →x (x) = max . . . max fs (x, x1 , . . . , xM ) x1
xM
Y
µxm →fs (xm ) =
M Y
P
with max in
µxm →fs (xm )
m=1
µf →xm (xm )
f ∈ne(xm )\fs
µx
leaf →f
µf
leaf →x
(x) = 1 (x) = f (x)
(KDD 2012 Tutorial)
Graphical Models
108 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root
(KDD 2012 Tutorial)
Graphical Models
109 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root
(KDD 2012 Tutorial)
Graphical Models
109 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root Unlike sum-product, do not pass messages back from root to leaves
(KDD 2012 Tutorial)
Graphical Models
109 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root Unlike sum-product, do not pass messages back from root to leaves At the root, multiply incoming messages Y pmax = max µf →x (x) x
(KDD 2012 Tutorial)
f ∈ne(x)
Graphical Models
109 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
As in sum-product, pick an arbitrary variable node x as the root Pass messages up from leaves until they reach the root Unlike sum-product, do not pass messages back from root to leaves At the root, multiply incoming messages Y pmax = max µf →x (x) x
f ∈ne(x)
This is the probability of the most likely state configuration
(KDD 2012 Tutorial)
Graphical Models
109 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
To identify the configuration itself, keep back pointers:
(KDD 2012 Tutorial)
Graphical Models
110 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
To identify the configuration itself, keep back pointers: When creating the message µfs →x (x) = max . . . max fs (x, x1 , . . . , xM ) x1
xM
M Y
µxm →fs (xm )
m=1
for each x value, we separately create M pointers back to the values of x1 , . . . , xM that achieve the maximum.
(KDD 2012 Tutorial)
Graphical Models
110 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
To identify the configuration itself, keep back pointers: When creating the message µfs →x (x) = max . . . max fs (x, x1 , . . . , xM ) x1
xM
M Y
µxm →fs (xm )
m=1
for each x value, we separately create M pointers back to the values of x1 , . . . , xM that achieve the maximum. At the root, backtrack the pointers.
(KDD 2012 Tutorial)
Graphical Models
110 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
1/4
f1
z1
f2
1/2
z2 1
P(z1)P(x1| z1)
P(z2| z1)P(x2| z2)
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
Message from leaf f1 µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8
(KDD 2012 Tutorial)
Graphical Models
111 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
1/4
f1
z1
f2
1/2
z2 1
P(z1)P(x1| z1)
P(z2| z1)P(x2| z2)
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
Message from leaf f1 µf1 →z1 (z1 = 1) = P (z1 = 1)P (R|z1 = 1) = 1/2 · 1/2 = 1/4 µf1 →z1 (z1 = 2) = P (z1 = 2)P (R|z1 = 2) = 1/2 · 1/4 = 1/8 The second message µz1 →f2 (z1 = 1) = 1/4 µz1 →f2 (z1 = 2) = 1/8
(KDD 2012 Tutorial)
Graphical Models
111 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm
1/4
f1
z1
f2
1/2
z2 1
P(z1)P(x1| z1)
P(z2| z1)P(x2| z2)
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
µf2 →z2 (z2 = 1) = max f2 (z1 , z2 )µz1 →f2 (z1 ) z1
= max P (z2 = 1 | z1 )P (x2 = G | z2 = 1)µz1 →f2 (z1 ) z1
= max(1/4 · 1/4 · 1/4, 1/2 · 1/4 · 1/8) = 1/64 Back pointer for z2 = 1: either z1 = 1 or z1 = 2
(KDD 2012 Tutorial)
Graphical Models
112 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm 1/4
f1
z1
f2
1/2
z2 1
P(z1)P(x1| z1)
P(z2| z1)P(x2| z2)
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
The other element of the same message: µf2 →z2 (z2 = 2) = max f2 (z1 , z2 )µz1 →f2 (z1 ) z1
= max P (z2 = 2 | z1 )P (x2 = G | z2 = 2)µz1 →f2 (z1 ) z1
= max(3/4 · 1/2 · 1/4, 1/2 · 1/2 · 1/8) = 3/32 Back pointer for z2 = 2: z1 = 1 (KDD 2012 Tutorial)
Graphical Models
113 / 131
Inference
Maximizing Problems
Intermediate: The Max-Product Algorithm 1/4
f1
z1
f2
1/2
z2 1
P(z1)P(x1| z1)
P(z2| z1)P(x2| z2)
2
P(x | z=1)=(1/2, 1/4, 1/4) R G B
P(x | z=2)=(1/4, 1/2, 1/4) R G B
π 1 = π 2 = 1/2
µf2 →z2 =
1/64 →z1 =1,2 3/32 →z1 =1
At root z2 , max µf2 →z2 (s) = 3/32
s=1,2
z2 = 2 → z1 = 1 ∗ z1:2 = arg max p(z1:2 |x1:2 ) = (1, 2) z1:2
In this example, sum-product and max-product produce the same best sequence; In general they differ. (KDD 2012 Tutorial)
Graphical Models
114 / 131
Inference
Maximizing Problems
From Max-Product to Max-Sum The max-sum algorithm is equivalent to the max-product algorithm, but work in log space to avoid underflow. µfs →x (x) =
max log fs (x, x1 , . . . , xM ) +
x1 ...xM
µxm →fs (xm )
m=1
X
µxm →fs (xm ) =
M X
µf →xm (xm )
f ∈ne(xm )\fs
µx
leaf →f
(x) = 0
µf
(x) = log f (x) leaf →x When at the root, log pmax = max x
X
µf →x (x)
f ∈ne(x)
The back pointers are the same. (KDD 2012 Tutorial)
Graphical Models
115 / 131
Parameter Learning
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
116 / 131
Parameter Learning
Parameter Learning
Assume the graph structure is given
(KDD 2012 Tutorial)
Graphical Models
117 / 131
Parameter Learning
Parameter Learning
Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn .
(KDD 2012 Tutorial)
Graphical Models
117 / 131
Parameter Learning
Parameter Learning
Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood
(KDD 2012 Tutorial)
Graphical Models
117 / 131
Parameter Learning
Parameter Learning
Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood Distinguish two cases:
(KDD 2012 Tutorial)
Graphical Models
117 / 131
Parameter Learning
Parameter Learning
Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood Distinguish two cases: fully observed data: all dimensions of x are observed
(KDD 2012 Tutorial)
Graphical Models
117 / 131
Parameter Learning
Parameter Learning
Assume the graph structure is given Learning in exponential family: estimate θ from iid data x1 . . . xn . Principle: maximum likelihood Distinguish two cases: fully observed data: all dimensions of x are observed partially observed data: some dimensions of x are unobserved.
(KDD 2012 Tutorial)
Graphical Models
117 / 131
Parameter Learning
Fully Observed Data pθ (x) = exp θ> φ(x) − A(θ) Given iid data x1 . . . xn , the log likelihood is ! n n X 1 1X `(θ) = log pθ (xi ) = θ> φ(xi ) − A(θ) = θ> µ ˆ − A(θ) n n i=1
(KDD 2012 Tutorial)
i=1
Graphical Models
118 / 131
Parameter Learning
Fully Observed Data pθ (x) = exp θ> φ(x) − A(θ) Given iid data x1 . . . xn , the log likelihood is ! n n X 1 1X `(θ) = log pθ (xi ) = θ> φ(xi ) − A(θ) = θ> µ ˆ − A(θ) n n i=1
i=1
P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M.
(KDD 2012 Tutorial)
Graphical Models
118 / 131
Parameter Learning
Fully Observed Data pθ (x) = exp θ> φ(x) − A(θ) Given iid data x1 . . . xn , the log likelihood is ! n n X 1 1X `(θ) = log pθ (xi ) = θ> φ(xi ) − A(θ) = θ> µ ˆ − A(θ) n n i=1
i=1
P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M. Maximum likelihood: θM L = arg supθ∈Ω θ> µ ˆ − A(θ)
(KDD 2012 Tutorial)
Graphical Models
118 / 131
Parameter Learning
Fully Observed Data pθ (x) = exp θ> φ(x) − A(θ) Given iid data x1 . . . xn , the log likelihood is ! n n X 1 1X `(θ) = log pθ (xi ) = θ> φ(xi ) − A(θ) = θ> µ ˆ − A(θ) n n i=1
i=1
P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M. Maximum likelihood: θM L = arg supθ∈Ω θ> µ ˆ − A(θ) The solution is θM L = θ(ˆ µ), the exponential family density whose mean parameter matches µ ˆ.
(KDD 2012 Tutorial)
Graphical Models
118 / 131
Parameter Learning
Fully Observed Data pθ (x) = exp θ> φ(x) − A(θ) Given iid data x1 . . . xn , the log likelihood is ! n n X 1 1X `(θ) = log pθ (xi ) = θ> φ(xi ) − A(θ) = θ> µ ˆ − A(θ) n n i=1
i=1
P µ ˆ ≡ n1 ni=1 φ(xi ) is the mean parameter of the empirical distribution on x1 . . . xn . Clearly µ ˆ ∈ M. Maximum likelihood: θM L = arg supθ∈Ω θ> µ ˆ − A(θ) The solution is θM L = θ(ˆ µ), the exponential family density whose mean parameter matches µ ˆ. When µ ˆ ∈ M0 and φ minimal, there is a unique maximum likelihood solution θM L . (KDD 2012 Tutorial)
Graphical Models
118 / 131
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved
(KDD 2012 Tutorial)
Graphical Models
119 / 131
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn
(KDD 2012 Tutorial)
Graphical Models
119 / 131
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz
(KDD 2012 Tutorial)
Graphical Models
119 / 131
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz P Can be written as `(θ) = n1 ni=1 Axi (θ) − A(θ)
(KDD 2012 Tutorial)
Graphical Models
119 / 131
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz P Can be written as `(θ) = n1 ni=1 Axi (θ) − A(θ) New log partition function of pθ (z | xi ), one per item: Z Axi (θ) = log exp(θ> φ(xi , z0 ))dz0
(KDD 2012 Tutorial)
Graphical Models
119 / 131
Parameter Learning
Partially Observed Data
Each item (x, z) where x observed, z unobserved Full data (x1 , z1 ) . . . (xn , zn ), but we only observe x1 . . . xn P The incomplete likelihood `(θ) = n1 ni=1 log pθ (xi ) where R pθ (xi ) = pθ (xi , z)dz P Can be written as `(θ) = n1 ni=1 Axi (θ) − A(θ) New log partition function of pθ (z | xi ), one per item: Z Axi (θ) = log exp(θ> φ(xi , z0 ))dz0 Expectation-Maximization (EM) algorithm: lower bound Axi
(KDD 2012 Tutorial)
Graphical Models
119 / 131
Parameter Learning
EM as Variational Lower Bound Mean parameter realizable by any distribution on z while holding xi fixed: Mxi = {µ ∈ Rd | µ = Ep [φ(xi , z)] for some p}
(KDD 2012 Tutorial)
Graphical Models
120 / 131
Parameter Learning
EM as Variational Lower Bound Mean parameter realizable by any distribution on z while holding xi fixed: Mxi = {µ ∈ Rd | µ = Ep [φ(xi , z)] for some p} The variational definition Axi (θ) = supµ∈Mx θ> µ − A∗xi (µ) i
(KDD 2012 Tutorial)
Graphical Models
120 / 131
Parameter Learning
EM as Variational Lower Bound Mean parameter realizable by any distribution on z while holding xi fixed: Mxi = {µ ∈ Rd | µ = Ep [φ(xi , z)] for some p} The variational definition Axi (θ) = supµ∈Mx θ> µ − A∗xi (µ) i
Trivial variational lower bound: Axi (θ) ≥ θ> µi − A∗xi (µi ), ∀µi ∈ Mxi
(KDD 2012 Tutorial)
Graphical Models
120 / 131
Parameter Learning
EM as Variational Lower Bound Mean parameter realizable by any distribution on z while holding xi fixed: Mxi = {µ ∈ Rd | µ = Ep [φ(xi , z)] for some p} The variational definition Axi (θ) = supµ∈Mx θ> µ − A∗xi (µ) i
Trivial variational lower bound: Axi (θ) ≥ θ> µi − A∗xi (µi ), ∀µi ∈ Mxi Lower bound L on the incomplete log likelihood: n
`(θ) =
1X Axi (θ) − A(θ) n i=1
≥
n 1 X > i ∗ i θ µ − Axi (µ ) − A(θ) n i=1
≡ L(µ1 , . . . , µn , θ)
(KDD 2012 Tutorial)
Graphical Models
120 / 131
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1 , . . . , µn , θ). In the E-step, maximizes each µi µi ← arg max L(µ1 , . . . , µn , θ) µi ∈Mxi
(KDD 2012 Tutorial)
Graphical Models
121 / 131
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1 , . . . , µn , θ). In the E-step, maximizes each µi µi ← arg max L(µ1 , . . . , µn , θ) µi ∈Mxi
Equivalently, argmaxµi ∈Mx θ> µi − A∗xi (µi ) i
(KDD 2012 Tutorial)
Graphical Models
121 / 131
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1 , . . . , µn , θ). In the E-step, maximizes each µi µi ← arg max L(µ1 , . . . , µn , θ) µi ∈Mxi
Equivalently, argmaxµi ∈Mx θ> µi − A∗xi (µi ) i
This is the variational representation of the mean parameter µi (θ) = Eθ [φ(xi , z)]
(KDD 2012 Tutorial)
Graphical Models
121 / 131
Parameter Learning
Exact EM: The E-Step
The EM algorithm is coordinate ascent on L(µ1 , . . . , µn , θ). In the E-step, maximizes each µi µi ← arg max L(µ1 , . . . , µn , θ) µi ∈Mxi
Equivalently, argmaxµi ∈Mx θ> µi − A∗xi (µi ) i
This is the variational representation of the mean parameter µi (θ) = Eθ [φ(xi , z)] The E-step is named after this Eθ [] under the current parameters θ
(KDD 2012 Tutorial)
Graphical Models
121 / 131
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed: θ ← arg max L(µ1 , . . . , µn , θ) = arg max θ> µ ˆ − A(θ) θ∈Ω
(KDD 2012 Tutorial)
θ∈Ω
Graphical Models
122 / 131
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed: θ ← arg max L(µ1 , . . . , µn , θ) = arg max θ> µ ˆ − A(θ) θ∈Ω
µ ˆ=
1 n
θ∈Ω
Pn
i i=1 µ
(KDD 2012 Tutorial)
Graphical Models
122 / 131
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed: θ ← arg max L(µ1 , . . . , µn , θ) = arg max θ> µ ˆ − A(θ) θ∈Ω
µ ˆ=
1 n
θ∈Ω
Pn
i i=1 µ
The solution θ(ˆ µ) satisfies Eθ(ˆµ) [φ(x)] = µ ˆ
(KDD 2012 Tutorial)
Graphical Models
122 / 131
Parameter Learning
Exact EM: The M-Step
In the M-step, maximize θ holding the µ’s fixed: θ ← arg max L(µ1 , . . . , µn , θ) = arg max θ> µ ˆ − A(θ) θ∈Ω
µ ˆ=
1 n
θ∈Ω
Pn
i i=1 µ
The solution θ(ˆ µ) satisfies Eθ(ˆµ) [φ(x)] = µ ˆ Standard fully observed maximum likelihood problem, hence the name M-step
(KDD 2012 Tutorial)
Graphical Models
122 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
(KDD 2012 Tutorial)
Graphical Models
123 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
Improve but not necessarily maximize: “generalized EM”
(KDD 2012 Tutorial)
Graphical Models
123 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
Improve but not necessarily maximize: “generalized EM” The mean field method maximizes max
µi ∈Mxi (F )
(KDD 2012 Tutorial)
θ> µi − A∗xi (µi )
Graphical Models
123 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
Improve but not necessarily maximize: “generalized EM” The mean field method maximizes max
µi ∈Mxi (F )
θ> µi − A∗xi (µi )
up to local maximum
(KDD 2012 Tutorial)
Graphical Models
123 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
Improve but not necessarily maximize: “generalized EM” The mean field method maximizes max
µi ∈Mxi (F )
θ> µi − A∗xi (µi )
up to local maximum recall Mxi (F ) is an inner approximation to Mxi
(KDD 2012 Tutorial)
Graphical Models
123 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
Improve but not necessarily maximize: “generalized EM” The mean field method maximizes max
µi ∈Mxi (F )
θ> µi − A∗xi (µi )
up to local maximum recall Mxi (F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM
(KDD 2012 Tutorial)
Graphical Models
123 / 131
Parameter Learning
Variational EM For loopy graphs E-step often intractable. Can’t maximize max θ> µi − A∗xi (µi )
µi ∈Mxi
Improve but not necessarily maximize: “generalized EM” The mean field method maximizes max
µi ∈Mxi (F )
θ> µi − A∗xi (µi )
up to local maximum recall Mxi (F ) is an inner approximation to Mxi
Mean field E-step leads to generalized EM The sum-product algorithm does not lead to generalized EM (KDD 2012 Tutorial)
Graphical Models
123 / 131
Structure Learning
Outline 1
Overview
2
Definitions Directed Graphical Models Undirected Graphical Models Factor Graph
3
Inference Exact Inference Markov Chain Monte Carlo Belief Propagation Mean Field Algorithm Exponential Family and Variational Inference Maximizing Problems
4
Parameter Learning
5
Structure Learning (KDD 2012 Tutorial)
Graphical Models
124 / 131
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features
(KDD 2012 Tutorial)
Graphical Models
125 / 131
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features Let M ⊆ M be a log-linear model structure 1 P (X | M, θ) = exp Z
(KDD 2012 Tutorial)
Graphical Models
! X
θi fi (X)
i∈M
125 / 131
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features Let M ⊆ M be a log-linear model structure 1 P (X | M, θ) = exp Z
! X
θi fi (X)
i∈M
A score for the model M can be maxθ ln P (Data | M, θ)
(KDD 2012 Tutorial)
Graphical Models
125 / 131
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features Let M ⊆ M be a log-linear model structure 1 P (X | M, θ) = exp Z
! X
θi fi (X)
i∈M
A score for the model M can be maxθ ln P (Data | M, θ) The score is always better for larger M – needs regularization or Bayesian treatment
(KDD 2012 Tutorial)
Graphical Models
125 / 131
Structure Learning
Score-Based Structure Learning
Let M be all allowed candidate features Let M ⊆ M be a log-linear model structure 1 P (X | M, θ) = exp Z
! X
θi fi (X)
i∈M
A score for the model M can be maxθ ln P (Data | M, θ) The score is always better for larger M – needs regularization or Bayesian treatment M and θ treated separately
(KDD 2012 Tutorial)
Graphical Models
125 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N (µ, Σ)
(KDD 2012 Tutorial)
Graphical Models
126 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N (µ, Σ) The graphical model has p nodes x1 , . . . , xp
(KDD 2012 Tutorial)
Graphical Models
126 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N (µ, Σ) The graphical model has p nodes x1 , . . . , xp The edge between xi , xj is absent if and only if Ωij = 0, where Ω = Σ−1
(KDD 2012 Tutorial)
Graphical Models
126 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Consider a p-dimensional multivariate Gaussian N (µ, Σ) The graphical model has p nodes x1 , . . . , xp The edge between xi , xj is absent if and only if Ωij = 0, where Ω = Σ−1 Equivalently, xi , xj are conditionally independent given other variables
(KDD 2012 Tutorial)
Graphical Models
126 / 131
Structure Learning
Example 14 −16 4 −2 −16 32 −8 4 If we know Σ = 4 −8 8 −4 −2 4 −4 5
(KDD 2012 Tutorial)
Graphical Models
127 / 131
Structure Learning
Example
14 −16 4 −16 32 −8 If we know Σ = 4 −8 8 −2 4 −4 0.1667 0.0833 0.0833 0.0833 Then Ω = Σ−1 = 0.0000 0.0417 0 0
(KDD 2012 Tutorial)
−2 4 −4 5
Graphical Models
0.0000 0 0.0417 0 0.2500 0.1667 0.1667 0.3333
127 / 131
Structure Learning
Example 14 −16 4 −2 −16 32 −8 4 If we know Σ = 4 −8 8 −4 −2 4 −4 5 0.1667 0.0833 0.0000 0 0.0833 0.0833 0.0417 0 Then Ω = Σ−1 = 0.0000 0.0417 0.2500 0.1667 0 0 0.1667 0.3333 The corresponding graphical model structure is
x1
x2 x3 x4
(KDD 2012 Tutorial)
Graphical Models
127 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X (1) , . . . , X (n) ∼ N (µ, Σ)
(KDD 2012 Tutorial)
Graphical Models
128 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X (1) , . . . , X (n) ∼ N (µ, Σ) P The log likelihood is n2 log |Ω| − 12 ni=1 (X (i) − µ)> Ω(X (i) − µ)
(KDD 2012 Tutorial)
Graphical Models
128 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X (1) , . . . , X (n) ∼ N (µ, Σ) P The log likelihood is n2 log |Ω| − 12 ni=1 (X (i) − µ)> Ω(X (i) − µ) The maximum likelihood estimate of Σ is the sample covariance S=
1 X (i) ¯ > (X (i) − X) ¯ (X − X) n i
¯ is the sample mean where X
(KDD 2012 Tutorial)
Graphical Models
128 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
Let data be X (1) , . . . , X (n) ∼ N (µ, Σ) P The log likelihood is n2 log |Ω| − 12 ni=1 (X (i) − µ)> Ω(X (i) − µ) The maximum likelihood estimate of Σ is the sample covariance S=
1 X (i) ¯ > (X (i) − X) ¯ (X − X) n i
¯ is the sample mean where X S −1 is not a good estimate of Ω when n is small
(KDD 2012 Tutorial)
Graphical Models
128 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
For centered data, minimize a regularized problem instead: n
X 1 X (i) > − log |Ω| + X ΩX (i) + λ |Ωij | n i=1
(KDD 2012 Tutorial)
Graphical Models
i6=j
129 / 131
Structure Learning
Structure Learning for Gaussian Random Fields
For centered data, minimize a regularized problem instead: n
X 1 X (i) > − log |Ω| + X ΩX (i) + λ |Ωij | n i=1
i6=j
Known as glasso
(KDD 2012 Tutorial)
Graphical Models
129 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn )
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field conditional independence
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field conditional independence
Inference = p(XQ | XE ), in general XQ ∪ XE ⊂ {x1 . . . xn }
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field conditional independence
Inference = p(XQ | XE ), in general XQ ∪ XE ⊂ {x1 . . . xn } exact, MCMC, variational
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field conditional independence
Inference = p(XQ | XE ), in general XQ ∪ XE ⊂ {x1 . . . xn } exact, MCMC, variational
If p(x1 , . . . , xn ) not given, estimate it from data
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
Summary
Graphical model = joint distribution p(x1 , . . . , xn ) Bayesian network or Markov random field conditional independence
Inference = p(XQ | XE ), in general XQ ∪ XE ⊂ {x1 . . . xn } exact, MCMC, variational
If p(x1 , . . . , xn ) not given, estimate it from data parameter and structure learning
(KDD 2012 Tutorial)
Graphical Models
130 / 131
Structure Learning
References
Koller & Friedman, Probabilistic Graphical Models. MIT 2009
(KDD 2012 Tutorial)
Graphical Models
131 / 131
Structure Learning
References
Koller & Friedman, Probabilistic Graphical Models. MIT 2009 Wainwright & Jordan, Graphical Models, Exponential Families, and Variational Inference. FTML 2008
(KDD 2012 Tutorial)
Graphical Models
131 / 131
Structure Learning
References
Koller & Friedman, Probabilistic Graphical Models. MIT 2009 Wainwright & Jordan, Graphical Models, Exponential Families, and Variational Inference. FTML 2008 Bishop, Pattern Recognition and Machine Learning. Springer 2006.
(KDD 2012 Tutorial)
Graphical Models
131 / 131