Machine learning for vision Binary Restricted Boltzmann Machines ...

3 downloads 104 Views 229KB Size Report
Oct 16, 2013 ... J(x) is known as Free Energy (we can compute it). Roland Memisevic. Machine learning for vision. Binary RBM marginals and free energy.
Binary Restricted Boltzmann Machines I

Machine learning for vision

E(x, s) = −

Fall 2013 Roland Memisevic

RBMs define joint distribution over data x and hidden variables s using the energy function:

I

D X K X i=1 j=1

X

Wij xi sj −

j

Machine learning for vision

j

P

s,x

exp − E(x, s)

Roland Memisevic

i



Machine learning for vision

Binary RBM marginals I

The log-probability of an observation, x, is: log p(x) = log

s

X

p(x, s)

=

s

sj

=

− log Z + log − log Z + log

=

− log Z +

=

− log Z +

X

X

exp

s



X

exp

X

ci xi + log ci xi + log

i

X s1

X s1

X

=

x I

The density over x is ICA-like:

− log Z +

i

ci xi +

X j

bj sj +

...

X

sK

j

ci xi

i

exp

sK

XY

X

X j



X   sj b j + Wij xi i

X   exp sj bj + Wij xi i

X   X X exp s1 b1 + Wi1 xi exp . . . ... s2

i

X

X   exp sK bK + WiK xi i

 X  log 1 + exp bj + Wij xi i

So the probability is proportional to a product over K terms, each of which is based on a filter-response wT j x: p(x) ∝

Machine learning for vision

X

...

s1

sK

xi

X j

ci xi

i

i

X

Wij xi sj +

ij

...

Roland Memisevic

i

 1 exp − E(x, s) Z X X X  1 = exp Wij xi sj + bj sj + ci xi Z

with Z =

I

ci xi

Exponentiate and normalize to get the distribution:

ij

RBM graphical model

X

p(x, s) =

Lecture 8, October 16, 2013

Roland Memisevic

bj sj −

Y X  1 + exp bj + Wij xi j

Roland Memisevic

i

Machine learning for vision

s3

Tractable computations

Binary RBM conditional over hiddens I

The conditional over the hiddens is p(s|x) ∝ p(x, s) = Ω1 p(x, s)

I

I I

Computing the probability is not tractable in general, because we cannot compute (log)Z . But inferring the feature responses is easy. We can also compare the (log-)probabilities of two points x 1 , x 2 because (log) Z will cancel in the ratio (difference). So we can always tell which point is more likely under the model.

=

Ω2 exp

=

Ω2

I

I I

X

Wij xi sj +

ij

X

X

Y

b j sj +

j

X

ci xi

i



X X    sj bj + Wij xi exp ci xi i

i

X   sj bj + Wij xi i

  X  exp sj bj + Wij xi i

This is a product over all sj , so the hiddens are independent, given the data.    P For each sj we have p(sj |x) ∝ exp sj bj + i Wij xi But sj is binary, so

exp bj +

 i Wij xi  = P W 1 + exp ij xi i

P

1 + exp bj +

1   P − bj + i Wij xi

This is just a logistic sigmoid applied to a linear projection of the data. Roland Memisevic

Machine learning for vision

RBM training

Using an analogous calculation for x one gets: p(xi = 1|s) =

1   P 1 + exp − ci + j Wij sj

It turns out that if we change the RBM energy function to E(x, s) = −

I

Ω1 exp

j

Machine learning for vision

Binary RBM conditional over data

I

=

X

j

I

I

Ω1 exp

j

p(sj = 1|x) =

Roland Memisevic

=

D X K X i=1 j=1

Wij xi sj −

X j

2 1 X xj − cj bj sj + 2 2σ

Machine learning for vision

I I I

j

the conditional P over x turns into a 2Gaussian with mean ci + j Wij xj and variance σ in each dimension. Similarly, one can turn the conditionals over data or hiddens into any exponential family distribution. Roland Memisevic

I

I

Neither the log-probability nor its derivative are tractable. So one has to use approximations for learning. One such approximation is to use Gibbs sampling. This leads to a learning method known as contrastive divergence. Like K -means, it comes down to the combination of a Hebbian term and “active forgetting”.

Roland Memisevic

Machine learning for vision

Binary RBM marginals and free energy I

Recall that we can compute p(x) up to normalizing constant: X 1X p(x) = p(x, s) = exp(−E(x, s)) Z s s Y X  ∝ 1 + exp bj + Wij xi j

I

Binary RBM marginals and free energy

I

i

To derive learning rules it can be convenient to write this as  1 p(x) = exp − F x Z with X  F(x) = − log exp − E x, s

Recall that the Free Energy can be written:  X X  X F(x) = − log 1 + exp bj + Wij xi − ci xi j

I

i

i

The Free Energy is convenient to work with because it depends only on x and not on s

s

I

F(x) is known as Free Energy (we can compute it) Roland Memisevic

Machine learning for vision

Binary RBM derivatives I

I I

Machine learning for vision

Binary RBM derivatives

To train the RBM we need the derivative of the log-probability of data. Let θ denote some parameter in the model. The derivative of log p(x ∗ ) for some training point x ∗ is  ∂F(x) ∂ log p(x ∗ ) ∂F(x ∗ ) 1 X = − + exp − F(x) ∂θ ∂θ Z x ∂θ ∂F(x ∗ ) X ∂F(x) = − + p(x) ∂θ ∂θ x

Roland Memisevic

Roland Memisevic

Machine learning for vision

I

Furthermore, we have  X exp − E(x ∗ , s) ∂F(x ∗ ) ∂E(x ∗ , s)  P = ∗ 0 ∂θ ∂θ s0 exp − E(x , s ) s X ∂E(x ∗ , s) = p(s|x ∗ ) ∂θ s

I

Finally, we have ∂E(x, s) = −xi sj , ∂Wij

∂E(x, s) = −sj , ∂bj

Roland Memisevic

∂E(x, s) = −xi ∂ci

Machine learning for vision

Binary RBM derivatives I

Approximating the expectation I

Putting everything together yields: ∗ X X ∂E(x, s) ∂ log p(x ∗ ) ∗ ∂E(x , s) =− p(s|x ) + p(x, s) ∂θ ∂θ ∂θ s x,s

I

with ∂E(x, s) = −xi sj , ∂Wij I

I

∂E(x, s) = −sj , ∂bj

∂E(x, s) = −xi ∂ci

This is the sum of two terms: (i) an expectation under p(s|x ∗ ) and (ii) an expectation under p(x, s). The second term is, in general, is not tractable, so we have to approximate it. Roland Memisevic

I

I

I I

Goal: Get a sample (a set of points z) from some distribution p(z) = p(z1 , . . . , zM ) Solution: Define a Markov chain in z-space. If p(z) is an invariant distribution of the chain, and under some additional technical assumptions (“detailed balance”, “ergodicity”), running the Markov chain long enough will yield samples from p(z).

Roland Memisevic

Machine learning for vision

Note that

E[ˆf ] = E[f ] Roland Memisevic

Machine learning for vision

Digression: Gibbs Sampling I

I

If we have L IID samples z (l) from p(z), we can approximate the expectation using the empirical expectation X ˆf = 1 f (z (l) ) L l

Machine learning for vision

Digression: Gibbs Sampling

But the second term is just an expectation under the model. Consider the expectation of some function f (z) under some distribution p(z): Z E[f ] = f (z)p(z)dz

I

Gibbs sampling is one way to define such a Markov chain: It works in the case where we can sample from each conditional p(zi |z1 , . . . , zi−1 , zi+1 , . . . , zK )

I I

given all the other variables. (This is the case for the RBM.) It amounts to cycling through all variables (randomly or in some specific order) and updating z by sampling the conditionals:

Roland Memisevic

Machine learning for vision

Digression: Gibbs Sampling

Digression: Gibbs Sampling

Gibbs Sampling I I

Initialize all zi For τ = 1, . . . , T : I I I I

(τ +1)

Sample z1 Sample .. .

(τ +1) z2 (τ +1)

Sample zj

(τ +1)

I

I

p(zj |z1

I (τ )

(τ )

(τ )

from p(z1 |z2 , z3 , . . . , zM )

I

from

I

from

(τ +1) (τ ) (τ ) p(z2 |z1 , z3 , . . . , zM )

(τ +1)

(τ )

(τ )

, . . . , zj−1 , zj+1 , . . . , zM )

(τ +1)

Sample zM

(τ +1)

from p(zM |z1

(τ +1)

, z2

I I

L 1 X ∂E(x (l) , s(l) )

I

l=1

∂θ

For learning, iterate parameter updates and sampling. Roland Memisevic

Machine learning for vision

RBM learning intuition

For the RBM, because of the conditional independencies, it is easy to sample from p(s|x) and from p(x|s). We can sample the whole vector s from p(s|x) and the vector x from p(x|s) at once. Some call this “block Gibbs sampling”. After drawing L samples (x (l) , s(l) ) from the joint, we can approximate the second term in the RBM derivative using L

Roland Memisevic

Machine learning for vision

RBM Gibbs sampling

I

– End of Digression –

(τ +1)

, . . . , zM−1 )

(see eg. Bishop 2006)

Roland Memisevic

I

Recipe: Run this for a while, then freeze the process and you will have a sample z from p(z). Samples from nearby iterations will be highly correlated, so to get IID samples one needs to discard many intermediate samples. (Or run many chains in parallel.)

Machine learning for vision

I

I

I

I

Sampling from the model will produce “fantasy data” according to the model distribution. The weight update for the first term in the derivative is called “positive phase”. It is easy and efficient, because it involves only the conditional p(s|x) The update for the negative term is called “negative phase”, and it involves prolonged sampling to reach the equilibrium joint distribution p(x, s). Since the fantasy data enters the update equation with a negative sign, learning has the effect of increasing the probability where the training data is, and reducing it proportional to where the model thinks it should be. Roland Memisevic

Machine learning for vision

RBM learning intuition

Contrastive divergence training I

This intuition suggests a short-cut to speed up the sampling, consisting of two modifications to standard Gibbs sampling: 1. Start sampling at the data. 2. Rather than waiting to reach the equilibrium, sample for just a few steps (for example, for 1 step).

I

I

As we saw before, this has the effect of increasing p(x) at the data, and decreasing p(x) everywhere else.

I

I

Roland Memisevic

Machine learning for vision

Encouraging sparsity I

I

λ

j=1

I

N 2 1X B− p(sj |x n ) N n=1

where B is a small positive number, for example, B = 0.1, and λ is a sparsity parameter. This will pull hidden variable activities towards B during training. Roland Memisevic

Roland Memisevic

Machine learning for vision

Stacking

The RBM hiddens are typically not all that sparse and the filters do not always resemble simple cell features. It is common to encourage hidden variables to be sparse(r) by adding a penalty function to the learning objective, like K X

This is known as contrastive divergence (CD) learning. For one-step CD, the positive phase consists in computing, or sampling from, p(s|x) and the negative phase of sampling from p(x|s). The result is a model that is a good near the data and potentially bad elsewhere in the data space (but that may not matter!).

Machine learning for vision

I

I

I I

I

After training, we can infer the s for data, and train another model on the inferred s. Since the s depend non-linearly on the data, this won’t be a vacuous operation. (For linear model it would be). This is what coined the term “deep learning” in 2006. It allows us to train a feature hierarchy greedily, layer-by-layer. It is now common to stack other non-linear models, such as autoencoders.

Roland Memisevic

Machine learning for vision

Suggest Documents