Gaussian graphical models

7. Gaussian graphical models

Gaussian graphical models Gaussian belief propagation Kalman filtering Example: consensus propagation Convergence and correctness

Gaussian graphical models

7-1


belief propagation naturally extends to continuous distributions by replacing summations to integrals Y Z ψik (xi , xk )νk→i (xk ) dxk νi→j (xi ) = k∈∂i\j

integration can be intractable for general functions however, for Gaussian graphical models for jointly Gaussian random variables, we can avoid explicit integration by exploiting algebraic structure, which yields efficient inference algorithms


7-2

Multivariate jointly Gaussian random variables four 1. 2. 3.

definitions of a Gaussian random vector x ∈ Rn : x is Gaussian iff x = Au + b for standard i.i.d. Gaussian random vector u ∼ N (0, I) y = aT x is Gaussian for all a ∈ Rn covariance form: the probability density function is n 1 o 1 T −1 µ(x) = exp (x − m) Λ (x − m) − 2 (2π)n/2 |Λ|1/2

denoted as x ∼ N (m, Λ) with mean m = E[x] and covariance matrix Λ = E[(x − m)(x − m)T ] (for some positive definite Λ). 4. information form: the probability density function is o n 1 µ(x) ∝ exp − xT Jx + hT x 2 denoted as x ∼ N −1 (h, J) with potential vector h and information (or precision) matrix J (for some positive definite J) note that J = Λ−1 and h = Λ−1 m = Jm x can be non-Gaussian and the marginals still Gaussian Gaussian graphical models

7-3

consider two operations on the following Gaussian random vector x=

x1 x2

∼N

m1 m2

Λ11 , Λ21

Λ12 Λ22

= N −1

h1 h2

J11 , J21

J12 J22

marginalization is easy to compute when x is in covariance form x1

∼

N (m1 , Λ11 )

for x1 ∈ Rd1 , one only needs to read the corresponding entries of dimensions d1 and d21 but complicated when x is in information form x1

∼

N −1 (h0 , J 0 )

I −1 I 0 J −1 where J 0 = Λ−1 and 11 = 0 −1 I I 0 J −1 I 0 J −1 h h0 = J 0 m1 = 0 −1 we will prove that h0 = h1 − J12 J22 h2

−1 and J 0 = J11 − J12 J22 J21 what is wrong in computing the marginal with the above formula? for x1 ∈ Rd1 and x2 ∈ Rd2 and d1 d2 , inverting J22 requires runtime O(d22.8074 ) (Strassen algorithm) Gaussian graphical models

7-4

−1 Proof of J 0 = Λ−1 11 = J11 − J12 J22 J21 I I

J 0 is called Schur complement of the block J22 of the matrix J useful matrix identity

A C

I 0 B D

−BD−1 I −1

A C

B D

I −D−1 C

0 A − BD−1 C = I 0

0 D

I 0 (A − BD−1 C)−1 0 I −BD−1 −1 −1 −D C I 0 I 0 D (A − BD−1 C)−1 −S −1 BD−1 −D−1 CS −1 D−1 + D−1 CS −1 BD−1

= =

I

where S = A − BD−1 C since Λ = J −1 , −1 −1 J11 J12 (J11 − J12 J22 J21 )−1 Λ = = −1 J21 J22 −J22 J21 S −1

−1 J22

−1 −S −1 J12 J22 −1 −1 −1 + J22 J21 S J12 J22

−1 where S = J11 − J12 J22 J21 , which gives

Λ11

=

−1 (J11 − J12 J22 J21 )−1

hence, J0


=

−1 Λ−1 11 = J11 − J12 J22 J21

7-5

−1 Proof of h0 = J 0 m1 = h1 − J12 J22 h2 I

notice that since Λ

=

J11 J21

J12 J22

−1

S −1

=

−1 −J22 J21 S −1

−1 J22

−1 −S −1 J12 J22 −1 −1 + J22 J21 S −1 J12 J22

−1 where S = J11 − J12 J22 J21 , we know from m = Λh that

S −1

h0 = J 0 m1

=

m1

=

−1 −S −1 J12 J22

h1 h2

since J 0 = S, we have


I

−1 −J12 J22

h1 h2

7-6

conditioning is easy to compute when x is in information form x1 |x2

∼

N −1 h1 − J12 x2 , J11

proof: treat x2 as a constant to get µ(x1 |x2 )

∝ ∝ ∝ =

µ(x1 , x2 ) n 1 J11 J12 x1 x1 o T exp − [xT + [hT hT ] 1 x2 ] 1 2 J21 J22 x2 x2 2 o n 1 T T T x J11 x1 + 2x2 J21 x1 + h1 x1 exp − 2 1 o n 1 T exp − xT 1 J11 x1 + (h1 − J12 x2 ) x1 2

but complicated when x is in covariance form x1 |x2

∼

N (m0 , Λ0 )

−1 0 where m0 = m1 + Λ12 Λ−1 22 (x2 − m2 ) and Λ = Λ11 − Λ12 Λ22 Λ21


7-7

Gaussian graphical model theorem 1. For x ∼ N (m, Λ), xi and xj are independent if and only if Λij = 0 Q. for what other distribution does uncorrelation imply independence? theorem 2. For x ∼ N −1 (h, J), xi –xV \{i,j} –xj if and only if Jij = 0 Q. is it obvious? graphical model representation of Gaussian random vectors I I

J encodes the pairwise Markov independencies obtain Gaussian graphical model by adding an edge whenever Jij 6= 0 µ(x)

∝ =

exp Y i∈V

I I

o 1 T x Jx + hT x 2 Y − 1 xT J x T −1 xT i Jii xi +hi xi i ij j 2 e| e| 2 {z {z } } n

−

ψi (xi )

(i,j)∈E

ψij (xi ,xj )

is pairwise Markov property enough? Is pairwise Markov Random Field enough?

problem: compute marginals µ(xi ) when G is a tree I

I

messages and marginals are Gaussian, completely specified by mean and variance simple algebra to compute integration


7-8

example: heredity of head dimensions [Frets 1921] estimated mean and covariance of four dimensional vector (L1 , B1 , L2 , B2 ) lengths and breadths of first and second born sons are measured 25 samples analyses by [Whittaker 1990] support the following Gaussian graphical model


L1

L2

B1

B2

7-9

example: mathematics scores [Whittaker 1990] Examination scores of 88 students in 5 subjects empirical information matrix (diagonal and above) covariance (below diagonal) Mechanics Vectors Algebra Analysis Statistics

Mechanics 5.24 0.33 0.23 0.00 0.02

Vectors -2.44 10.43 0.28 0.08 0.02

Vectors

Algebra -2.74 -4.71 26.95 0.43 0.36

Analysis 0.01 -0.79 -7.05 9.88 0.25

Statistics -0.14 -0.17 -4.70 -2.02 6.45

Analysis

Algebra

Mechanics


Statistics

7-10

Gaussian belief propagation on trees initialize messages on the leaves as Gaussian (each node has xi which can be either a scalar or a vector) νi→j (xi )

=

1

T

ψi (xi ) = e− 2 xi

Jii xi +hT i xi

∼ N −1 (hi→j , Ji→j )

where hi→j = hi and Ji→j = Jii

update messages assuming νk→i (xk ) ∼ N −1 (hk→i , Jk→i ) νi→j (xi )

=

ψi (xi )

Y Z

ψik (xi , xk )νk→i (xk ) dxk

k∈∂i\j

evaluating the integration (= marginalizing Gaussian) Z

Z ψik (xi , xk )νk→i (xk ) dxk =

T

e−xi

xT J x +hT Jki xk − 1 k→i xk 2 k k→i k

dxk o 1 0 Jik xi xi T T = exp − [xT x ] + [0 h ] dxk k k→i x Jik Jk→i xk 2 i k −1 −1 ∼ N −1 − Jik Jk→i hk→i , −Jik Jk→i Jki Z

n

since this is evaluating the marginal of xi for (xi , xk ) ∼ N −1 Gaussian graphical models

0 0 , hk→i Jik

Jik . Jk→i 7-11

therefore, messages are also Gaussian νi→j (xi ) ∼ N −1 (hi→j , Ji→j ) completely specified by two parameters: mean and variance Gaussian belief propagation hi→j

X

= hi −

−1 Jik Jk→i hk→i

k∈∂i\j

Ji→j

= Jii −

X

−1 Jik Jk→i Jki

k∈∂i\j

ˆ i , Jî ) marginal can be computed as xi ∼ N −1 (h î h

X

= hi −

−1 Jik Jk→i hk→i

k∈∂i

Jî

= Jii −

X

−1 Jik Jk→i Jki

k∈∂i

for xi ∈ I

Rd

Gaussian BP requires O(n · d3 ) operations on a tree

matrix inversion can be computed in O(d3 ) (e.g., Gaussian elimination)

if we naively invert the information matrix J22 of the entire graph x1

∼

−1 −1 N −1 (h1 − J12 J22 h2 , J11 − J12 J22 J21 )

requires O (nd)3 operations Gaussian graphical models

7-12

connections to Gaussian elimination I

one way to view Gaussian BP is that given J and h it computes m = J −1 h

I

this implies that for any positive-definite matrix A with tree structure, we can use Gaussian BP to solve for x Ax = b

I

I

example: Gaussian elimination 4 2 x1 2 3 x2 2 4 − 3 · 2 2 − 23 · 3 x1 x2 2 3 8 0 x1 3 2 3 x2

(x = A−1 b)

= =

3 3 3−

2 3

·3

3

1 = 3

Gaussian elimination that exploits tree structure by eliminating from the leaves is equivalent as Gaussian BP


7-13

MAP configuration I

for Gaussian random vectors, mean is the mode 1 max exp − (x − m)T Λ−1 (x − m) x 2 taking the gradient of the exponent ∂ 1 − (x − m)T Λ−1 (x − m) = −Λ−1 (x − m) ∂x 2 hence the mode x∗ = m


7-14

Gaussian hidden Markov models x0

x1

x2

x3

x4

x5

x6

y0

y1

y2

y3

y4

y5

y6

Gaussian HMM I I I

states xt ∈ Rd state transition matrix A ∈ Rd×d process noise vt ∈ Rp and ∼ N (0, V ) for some V ∈ Rp×p , B ∈ Rd×p xt+1 = Axt + Bvt x0 ∼ N (0, Λ0 ) 0

I I

0

observation yt ∈ Rd , C ∈ Rd ×d 0 0 observation noise wt ∼ N (0, W ) for some R ∈ Rd ×d yt = Cxt + wt


7-15

in summary, for H = BV B T x0 ∼ N (0, Λ0 ) xt+1 |xt ∼ N (Axt , H) yt |xt ∼ N (Cxt , W ) factorization µ(x, y) = µ(x0 )µ(y0 |x0 )µ(x1 |x0 )µ(y1 |x1 ) · · · 1 1 −1 T −1 ∝ exp − xT (y0 − Cx0 ) 0 Λ0 x0 exp − (y0 − Cx0 ) W 2 2 1 T −1 exp − (x1 − Ax0 ) H (x1 − Ax0 ) · · · 2 t t t t Y Y Y Y = ψk (xk ) ψk−1,k (xk−1 , xk ) φk (yk ) φk,k (xk , yk ) k=0


k=1

k=0

k=0

7-16

factorization µ(x, y) ∝

t Y k=0

ψk (xk )

t Y

ψk−1,k (xk−1 , xk )

k=1

t Y

t Y

φk (yk )

k=0

φk,k (xk , yk )

k=0

 −1 T −1 − 12 xT C + AT H −1 A)x0  0 (Λ0 + C W   | {z }    ≡ J0    − 1 xT (H −1 + C T W −1 C + AT H −1 A)x 2 k | {z } k log ψk (xk ) =   ≡ Jk   −1   + C T W −1 C )xt − 12 xT  t (H  | {z } 

k=0 00

proof ⇐ xT Jx = xT U DU T x = x ˜T D˜ x=

P

i

x ˜2i Dii > 0

2 has a Cholesky decomposition: there exists a (unique) lower triangular matrix L with strictly positive diagonal entries such that J = LT L proof ⇒ proof ⇐ xT Jx = xT LT Lx = kLxk2 > 0

3 satisfies Sylvester’s criterion: leading principal minors are all positive (a kth leading principal minor of a matrix J is the determinant of its upper left k by k sub-matrix)

there is no simple way to check if J is positive definite Gaussian graphical models

7-34

Sufficient conditions

toy example on 2 × 2 symmetric matrices a b b c what is the sufficient and necessary condition for positive definitenesss?


7-35

Sufficient conditions sufficient condition 1. J is positive definite if it is diagonally dominant, i.e., X |Jij | < Jii j6=i I

proof by Gershgorin’s circle theorem

[Gershgorin’s circle theorem] every eigenvalue of A ∈ Rn×n lies within at least one of the Gershgorin discs, defined for each i ∈ [n] as X Di ≡ x ∈ R |x − Jii | ≤ |Jij | j6=i



10  0  0.3 0

0.5 5 0 0.5

0.5 0.1 0 0.5

 0.5 0.2  0.5 −4

Corollary. diagonally dominant matrices are positive definite Gaussian graphical models

7-36

proof (for a amore general complex valued matrix A). consider an eigen value λ ∈ C and an eigen vector x ∈ Cn such that Ax = xλ let i denote the index of the maximum magnitude entry of x such that |xi | ≥ |xj | for all j 6= i, then it follows that X Aij xj = λxi j∈[n]

and X

Aij xj = (λ − Aii )xi

j6=i

dividing both sides by xi gives P Aij xj X Aij xj X j6=i |Aij | = Ri |λ − Aii | = ≤ ≤ xi xi j6=i


j6=i

7-37

when there is an overlap, it is possible to have an empty disc, for 0 1 1 −2 example and have eigen values {−2, 2} and {i, −i} 4 0 1 −1 theorem. if a union of k discs is disjoint from the union on the rest of n − k discs, then the former union contains k eigen values and the latter n − k. proof. let B(t) , (1 − t)diag(A) + t(A) for t ∈ [0, 1], and note that eigen values of B(t) are continuous in t. B(0) has eigen values at the center of the discs and the eigen values {λ(t)i }i∈[n] of B(t) move from this center as t increases, but by continuity the k eigen values of the first union of discs can not escape the expanding union of discs counter example? computational complexity?


7-38

Sufficient conditions sufficient condition 2. J is positive definite if it is pairwise normalizable, i.e., if there exists compatibility functions ψi ’s and P (ii) ψij ’s such that Jii = j∈∂i aij , − log ψi (xi ) = xTi ai xi + bTi xi (ii)

(jj)

(ij)

− log ψij (xi , xj ) = xTi aij xi + xTj aij xj + xTi aij xj " (ii) # 1 (ij) a aij ij we have ai > 0 for all i and 1 (ij) 2 (jj) is PSD for all 2 × 2 a a jj 2 ij minors R R R I

follows from

f (x)g(x) dx ≤

x1

x1 x2 x1 2 -1 x2 -1 3

x2

x3

x2 x3

x2 x3 1 2 2 3

|f (x)| dx

|g(x)| dx

x1

x1 x2 x1 2 -1 x2 -1 2

x2

x3

x2 x3

x2 x3 2 2 2 3

counter example? computational complexity? Gaussian graphical models

7-39

Correctness there is little theoretical understanding of loopy belief propagation (except for graphs with a single loop) perhaps surprisingly, loopy belief propagation (if it converges) gives the correct mean of Gaussian graphical models even if the graph has loops (convergence of the variance is not guaranteed) Theorem [Weiss, Freeman 2001, Rusmevichientong, Van Roy 2001] If Gaussian belief propagation converges, then the expectations are computed correctly: let (`)

m î

(`)

(`)

ˆ ≡ (Jî )−1 h i

(`)

where m ˆ i = belief propagation expectation after ` iterations (`) ˆ Ji = belief propagation information matrix after ` iterations ˆ (`) = belief propagation precision after ` iterations and if h i (∞) m ˆ i , lim`→∞ m ˆ∞ i exists, then (∞)

m î Gaussian graphical models

= mi 7-40

A detour: Computation tree (`)

what is m î ? computation tree CTG (i; `) is the tree of `-steps non-reversing walks on G starting at i. r = i0 i

a i, j, k, . . . , a, b, . . . for nodes in G and r, s, t, . . . for nodes in CTG (i; `) potentials ψi and ψij are copied to CTG (i; `) each node (edge) in G corresponds to multiple nodes (edges) in CTG (i; `). natural projection π : CTG (i; `) → G, e.g., π(t) = π(s) = j Gaussian graphical models

7-41


what is m î ? computation tree CTG (i; `) is the tree of `-steps non-reversing walks on G starting at i. r = i0 b i

π(u) = b u

π(v) = c v

c a i, j, k, . . . , a, b, . . . for nodes in G and r, s, t, . . . for nodes in CTG (i; `) potentials ψi and ψij are copied to CTG (i; `) each node (edge) in G corresponds to multiple nodes (edges) in CTG (i; `). natural projection π : CTG (i; `) → G, e.g., π(t) = π(s) = j Gaussian graphical models

7-41



π(u) = b u

π(v) = c v

c a i, j, k, . . . , a, b, . . . for nodes in G and r, s, t, . . . for nodes in CTG (i; `) potentials ψi and ψij are copied to CTG (i; `) each node (edge) in G corresponds to multiple nodes (edges) in CTG (i; `). natural projection π : CTG (i; `) → G, e.g., π(t) = π(s) = j Gaussian graphical models

7-41



π(u) = b u

π(v) = c v

c a

t π(t) = a

s π(s) = a

i, j, k, . . . , a, b, . . . for nodes in G and r, s, t, . . . for nodes in CTG (i; `) potentials ψi and ψij are copied to CTG (i; `) each node (edge) in G corresponds to multiple nodes (edges) in CTG (i; `). natural projection π : CTG (i; `) → G, e.g., π(t) = π(s) = j Gaussian graphical models

7-41

(`)

What is m î ? (`)

(`)

Claim 1. m ˆ i is m ˆ r , which is the expectation of xr w.r.t. Gaussian model on CTG (i; `) I I

proof of claim 1. by induction over `. idea: BP ‘does not know’ whether it is operating on G or on CTG (i; `)

recall that for Gaussians, mode of − 12 xT Jx + hT x is the mean m, hence Jm = h and since J is invertible (due to positive definiteness), m = J −1 h. locally, m is the unique solution that satisfies all of the following series of equations for all i ∈ V X Jii mi + Jij mj = hi j∈∂i


7-42

similarly, for a Gaussian graphical model on CTG (i; `) xr Jrs xs

the estimated mean m ˆ (`) is exact on a tree. Precisely, since the width of the tree is at most 2`, the BP updates on CTG (i; `) converge to the correct marginals for t ≥ 2` and satisfy X Jrr m ˆ (t) Jrs m ˆ (t) s = hr r + s∈∂r

where r is the root of the computation tree. In terms of the original information matrix J and potential h X Jπ(r),π(r) m ˆ (t) Jπ(r),π(s) m ˆ (t) r + s = hπ(r) s∈∂r

since we copy J and h for each edge and node in CTG (i; `). Gaussian graphical models

7-43

I

I

I

(t)

(`)

note that on the computation tree CTG (i, ; `), m ˆr = m ˆ r for t ≥ ` since the root r is at most distance ` away from any node. (t) (`+1) similarly, for a neighbor s of the root r, m ˆs = m ˆs for t ≥ ` + 1 since s is at most distance ` + 1 away from any node. hence we can write the above equation as X Jπ(r),π(r) m ˆ (`) Jπ(r),π(s) m ˆ (`+1) = hπ(r) (1) r + s s∈∂r


7-44

if the BP fixed point converges then (`)

(∞)

lim m î = m î

`→∞ (`)

(∞)

we claim that lim`→∞ m ˆr = m ˆ π(r) , since lim m ˆ (`) = r

`→∞

=

(`) lim m ˆ `→∞ π(r) (∞) m ˆ π(r)

by Claim 1. by the convergence assumption

we can generalize this argument (without explicitly proving it in this lecture) to claim that in the computation tree CTG (i; `) if we consider a neighbor s of the root r, (∞)

lim m ˆ (`+1) = m ˆ π(s) s

`→∞


7-45

Convergence from Eq. (1), we have Jπ(r),π(r) m ˆ (`) r +

X

Jπ(r),π(s) m ˆ (`+1) = hπ(r) s

s∈∂r

taking the limit ` → ∞, (∞)

Jπ(r),π(r) m ˆ π(r) +

X

(∞)

Jπ(r),π(s) m ˆ π(s) = hπ(r)

s∈∂r

hence, BP is exact on the original graph with loops assuming convergence, i.e. BP is correct: X (∞) (∞) Ji,j m ˆj = hi Ji,i m î + j∈∂i

Jm ˆ (∞) = h Gaussian graphical models

7-46

What have we achieved?

complexity? convergence? correlation decay: the influence of leaf nodes on the computation tree decreases as iterations increase understanding BP in a broader class of graphical models (loopy belief propagation) help clarify the empirical performance results (e.g. Turbo codes)


7-47

Gaussian Belief Propagation (GBP)

Sufficient conditions for convergence and correctness of GBP I

I

I

I

I

Rusmevichientong and Van Roy (2001), Wainwright, Jaakkola, Willsky (2003) : if means converge, then they are correct Weiss and Freeman (2001): if the information matrix is diagonally dominant, then GBP converges convergence known for trees, attractive, non-frustrated, and diagonally dominant Gaussian graphical models Malioutov, Johnson, Willsky (2006): walk-summable graphical models converge (this includes all of the known cases above) Moallemi and Van roy (2006): if pairwise normalizable then consensus propagation converges


7-48

Gaussian graphical models

Gaussian graphical models

Suggest Documents

Structured Regularization for conditional Gaussian Graphical Models

Copula Gaussian Graphical Models - University of Washington

Regularized Estimates in Gaussian Graphical Models - Wisc

Exact Inference on Gaussian Graphical Models of

Covariance estimation in decomposable Gaussian graphical models

Bayesian structure learning in sparse Gaussian graphical models

Copula Gaussian Graphical Models with Hidden Variables - MIRLab

Bayesian precision matrix estimation for graphical Gaussian models ...

Identifiability of directed Gaussian graphical models with one latent ...

Fitting Directed Graphical Gaussian Models with One Hidden Variable

Gaussian Graphical Models to Infer Putative Genes ... - Springer Link

Copula Gaussian graphical models and their application to ... - arXiv

Bayesian structure learning in sparse Gaussian graphical models

Hypothesis Testing for Differences in Gaussian Graphical Models - arXiv

Bayesian inference for general Gaussian graphical models with ... - arXiv

Composite likelihood estimation of sparse Gaussian graphical models ...

Graphical Models

Duality in Graphical Models

Graphical models in Macaulay2

Graphical Models - Google Sites

Graphical Models - Google Sites

Graphical models - BMS

Graphical RNN Models

Decomposable graphical gaussian model determination - CiteSeerX