Machine learning for vision Binary Restricted Boltzmann Machines ...

Binary Restricted Boltzmann Machines I

Machine learning for vision

E(x, s) = −

Fall 2013 Roland Memisevic

RBMs define joint distribution over data x and hidden variables s using the energy function:

I

D X K X i=1 j=1

X

Wij xi sj −

j


j

P

s,x

exp − E(x, s)

Roland Memisevic

i


Binary RBM marginals I

The log-probability of an observation, x, is: log p(x) = log

s

X

p(x, s)

=

s

sj

=

− log Z + log − log Z + log

=

− log Z +

=

− log Z +

X

X

exp

s

X

exp

X

ci xi + log ci xi + log

i

X s1

X s1

X

=

x I

The density over x is ICA-like:

− log Z +

i

ci xi +

X j

bj sj +

...

X

sK

j

ci xi

i

exp

sK

XY

X

X j

X sj b j + Wij xi i

X exp sj bj + Wij xi i

X X X exp s1 b1 + Wi1 xi exp . . . ... s2

i

X

X exp sK bK + WiK xi i

X log 1 + exp bj + Wij xi i

So the probability is proportional to a product over K terms, each of which is based on a filter-response wT j x: p(x) ∝


X

...

s1

sK

xi

X j

ci xi

i

i

X

Wij xi sj +

ij

...

Roland Memisevic

i

1 exp − E(x, s) Z X X X 1 = exp Wij xi sj + bj sj + ci xi Z

with Z =

I

ci xi

Exponentiate and normalize to get the distribution:

ij

RBM graphical model

X

p(x, s) =

Lecture 8, October 16, 2013

Roland Memisevic

bj sj −

Y X 1 + exp bj + Wij xi j

Roland Memisevic

i


s3

Tractable computations

Binary RBM conditional over hiddens I

The conditional over the hiddens is p(s|x) ∝ p(x, s) = Ω1 p(x, s)

I

I I

Computing the probability is not tractable in general, because we cannot compute (log)Z . But inferring the feature responses is easy. We can also compare the (log-)probabilities of two points x 1 , x 2 because (log) Z will cancel in the ratio (difference). So we can always tell which point is more likely under the model.

=

Ω2 exp

=

Ω2

I

I I

X

Wij xi sj +

ij

X

X

Y

b j sj +

j

X

ci xi

i

X X sj bj + Wij xi exp ci xi i

i

X sj bj + Wij xi i

X exp sj bj + Wij xi i

This is a product over all sj , so the hiddens are independent, given the data. P For each sj we have p(sj |x) ∝ exp sj bj + i Wij xi But sj is binary, so

exp bj +

i Wij xi = P W 1 + exp ij xi i

P

1 + exp bj +

1 P − bj + i Wij xi

This is just a logistic sigmoid applied to a linear projection of the data. Roland Memisevic


RBM training

Using an analogous calculation for x one gets: p(xi = 1|s) =

1 P 1 + exp − ci + j Wij sj

It turns out that if we change the RBM energy function to E(x, s) = −

I

Ω1 exp

j


Binary RBM conditional over data

I

=

X

j

I

I

Ω1 exp

j

p(sj = 1|x) =

Roland Memisevic

=

D X K X i=1 j=1

Wij xi sj −

X j

2 1 X xj − cj bj sj + 2 2σ


I I I

j

the conditional P over x turns into a 2Gaussian with mean ci + j Wij xj and variance σ in each dimension. Similarly, one can turn the conditionals over data or hiddens into any exponential family distribution. Roland Memisevic

I

I

Neither the log-probability nor its derivative are tractable. So one has to use approximations for learning. One such approximation is to use Gibbs sampling. This leads to a learning method known as contrastive divergence. Like K -means, it comes down to the combination of a Hebbian term and “active forgetting”.

Roland Memisevic


Binary RBM marginals and free energy I

Recall that we can compute p(x) up to normalizing constant: X 1X p(x) = p(x, s) = exp(−E(x, s)) Z s s Y X ∝ 1 + exp bj + Wij xi j

I

Binary RBM marginals and free energy

I

i

To derive learning rules it can be convenient to write this as 1 p(x) = exp − F x Z with X F(x) = − log exp − E x, s

Recall that the Free Energy can be written: X X X F(x) = − log 1 + exp bj + Wij xi − ci xi j

I

i

i

The Free Energy is convenient to work with because it depends only on x and not on s

s

I

F(x) is known as Free Energy (we can compute it) Roland Memisevic


Binary RBM derivatives I

I I


Binary RBM derivatives

To train the RBM we need the derivative of the log-probability of data. Let θ denote some parameter in the model. The derivative of log p(x ∗ ) for some training point x ∗ is ∂F(x) ∂ log p(x ∗ ) ∂F(x ∗ ) 1 X = − + exp − F(x) ∂θ ∂θ Z x ∂θ ∂F(x ∗ ) X ∂F(x) = − + p(x) ∂θ ∂θ x

Roland Memisevic

Roland Memisevic


I

Furthermore, we have X exp − E(x ∗ , s) ∂F(x ∗ ) ∂E(x ∗ , s) P = ∗ 0 ∂θ ∂θ s0 exp − E(x , s ) s X ∂E(x ∗ , s) = p(s|x ∗ ) ∂θ s

I

Finally, we have ∂E(x, s) = −xi sj , ∂Wij

∂E(x, s) = −sj , ∂bj

Roland Memisevic

∂E(x, s) = −xi ∂ci


Binary RBM derivatives I

Approximating the expectation I

Putting everything together yields: ∗ X X ∂E(x, s) ∂ log p(x ∗ ) ∗ ∂E(x , s) =− p(s|x ) + p(x, s) ∂θ ∂θ ∂θ s x,s

I

with ∂E(x, s) = −xi sj , ∂Wij I

I

∂E(x, s) = −sj , ∂bj

∂E(x, s) = −xi ∂ci

This is the sum of two terms: (i) an expectation under p(s|x ∗ ) and (ii) an expectation under p(x, s). The second term is, in general, is not tractable, so we have to approximate it. Roland Memisevic

I

I

I I

Goal: Get a sample (a set of points z) from some distribution p(z) = p(z1 , . . . , zM ) Solution: Define a Markov chain in z-space. If p(z) is an invariant distribution of the chain, and under some additional technical assumptions (“detailed balance”, “ergodicity”), running the Markov chain long enough will yield samples from p(z).

Roland Memisevic


Note that

E[ˆf ] = E[f ] Roland Memisevic


Digression: Gibbs Sampling I

I

If we have L IID samples z (l) from p(z), we can approximate the expectation using the empirical expectation X ˆf = 1 f (z (l) ) L l


Digression: Gibbs Sampling

But the second term is just an expectation under the model. Consider the expectation of some function f (z) under some distribution p(z): Z E[f ] = f (z)p(z)dz

I

Gibbs sampling is one way to define such a Markov chain: It works in the case where we can sample from each conditional p(zi |z1 , . . . , zi−1 , zi+1 , . . . , zK )

I I

given all the other variables. (This is the case for the RBM.) It amounts to cycling through all variables (randomly or in some specific order) and updating z by sampling the conditionals:

Roland Memisevic




Gibbs Sampling I I

Initialize all zi For τ = 1, . . . , T : I I I I

(τ +1)

Sample z1 Sample .. .

(τ +1) z2 (τ +1)

Sample zj

(τ +1)

I

I

p(zj |z1

I (τ )

(τ )

(τ )

from p(z1 |z2 , z3 , . . . , zM )

I

from

I

from

(τ +1) (τ ) (τ ) p(z2 |z1 , z3 , . . . , zM )

(τ +1)

(τ )

(τ )

, . . . , zj−1 , zj+1 , . . . , zM )

(τ +1)

Sample zM

(τ +1)

from p(zM |z1

(τ +1)

, z2

I I

L 1 X ∂E(x (l) , s(l) )

I

l=1

∂θ

For learning, iterate parameter updates and sampling. Roland Memisevic


RBM learning intuition

For the RBM, because of the conditional independencies, it is easy to sample from p(s|x) and from p(x|s). We can sample the whole vector s from p(s|x) and the vector x from p(x|s) at once. Some call this “block Gibbs sampling”. After drawing L samples (x (l) , s(l) ) from the joint, we can approximate the second term in the RBM derivative using L

Roland Memisevic


RBM Gibbs sampling

I

– End of Digression –

(τ +1)

, . . . , zM−1 )

(see eg. Bishop 2006)

Roland Memisevic

I

Recipe: Run this for a while, then freeze the process and you will have a sample z from p(z). Samples from nearby iterations will be highly correlated, so to get IID samples one needs to discard many intermediate samples. (Or run many chains in parallel.)


I

I

I

I

Sampling from the model will produce “fantasy data” according to the model distribution. The weight update for the first term in the derivative is called “positive phase”. It is easy and efficient, because it involves only the conditional p(s|x) The update for the negative term is called “negative phase”, and it involves prolonged sampling to reach the equilibrium joint distribution p(x, s). Since the fantasy data enters the update equation with a negative sign, learning has the effect of increasing the probability where the training data is, and reducing it proportional to where the model thinks it should be. Roland Memisevic


RBM learning intuition

Contrastive divergence training I

This intuition suggests a short-cut to speed up the sampling, consisting of two modifications to standard Gibbs sampling: 1. Start sampling at the data. 2. Rather than waiting to reach the equilibrium, sample for just a few steps (for example, for 1 step).

I

I

As we saw before, this has the effect of increasing p(x) at the data, and decreasing p(x) everywhere else.

I

I

Roland Memisevic


Encouraging sparsity I

I

λ

j=1

I

N 2 1X B− p(sj |x n ) N n=1

where B is a small positive number, for example, B = 0.1, and λ is a sparsity parameter. This will pull hidden variable activities towards B during training. Roland Memisevic

Roland Memisevic


Stacking

The RBM hiddens are typically not all that sparse and the filters do not always resemble simple cell features. It is common to encourage hidden variables to be sparse(r) by adding a penalty function to the learning objective, like K X

This is known as contrastive divergence (CD) learning. For one-step CD, the positive phase consists in computing, or sampling from, p(s|x) and the negative phase of sampling from p(x|s). The result is a model that is a good near the data and potentially bad elsewhere in the data space (but that may not matter!).


I

I

I I

I

After training, we can infer the s for data, and train another model on the inferred s. Since the s depend non-linearly on the data, this won’t be a vacuous operation. (For linear model it would be). This is what coined the term “deep learning” in 2006. It allows us to train a feature hierarchy greedily, layer-by-layer. It is now common to stack other non-linear models, such as autoencoders.

Roland Memisevic


Machine learning for vision Binary Restricted Boltzmann Machines ...

Machine learning for vision Binary Restricted Boltzmann Machines ...

Suggest Documents

Subspace Restricted Boltzmann Machine

Subspace Restricted Boltzmann Machine

Capsule Restricted Boltzmann Machine - Bayesian Deep Learning

Temperature based Restricted Boltzmann Machines

Gaussian-binary Restricted Boltzmann Machines on Modeling Natural ...

Nonnegative Restricted Boltzmann Machines for Parts-based ...

Enhanced Gradient for Training Restricted Boltzmann Machines

Explainable Restricted Boltzmann Machines for Collaborative ... - arXiv

Composite Likelihood Estimation for Restricted Boltzmann machines

Temporal Autoencoding Restricted Boltzmann Machine

Privacy-Preserving Restricted Boltzmann Machine

Feature Learning with Gaussian Restricted Boltzmann Machine for ...

Feature Learning with Gaussian Restricted Boltzmann Machine for

Discriminative conditional restricted Boltzmann machine for discrete ...

PLDA using Gaussian Restricted Boltzmann Machines with ...

Training Restricted Boltzmann Machines: An Introductionâ

Model matters with restricted Boltzmann machines - Bitly

On the propriety of restricted Boltzmann machines

Restricted Boltzmann Machines Modeling Human Choice.pdf

An Introduction to Restricted Boltzmann Machines

Discrete Restricted Boltzmann Machines 1 Introduction

Training restricted Boltzmann machines - imagine - ENPC

Phone recognition using restricted boltzmann machines

Spectral Classification Using Restricted Boltzmann Machine

Machine learning for vision Binary Restricted Boltzmann Machines ...