RBMs and autoencoders - Google Sites

RBMs and autoencoders

Hopfield network • Content-addressable memory • The goal is to memorize the training dataset • Core idea: – Training Store some patterns into the network (set weights accordingly) – Inference Show the net a corrupted pattern, the net reconstructs the original one.

Examples Stored patterns

Operation – pattern reconstruction

Anatomy of a Hopfield net No self-loops 𝑤𝑖𝑖 = 0

Inference: Let 𝑥 be the state of all neurons 1. While not converged: 1. Pick a neuron 𝑘 at random 2. Set 𝑋𝑘 = 𝑠𝑖𝑔𝑛(𝑊𝑘: 𝑋)

Symmetrical, bidirectional weights 𝑤𝑖𝑗 = 𝑤𝑗𝑖

Trainig: Idea: „fire together wire together” 𝑊 = 𝑋𝑋 𝑇 𝑊𝑖𝑖 = 0 where 𝑋 = [𝑥 1 , … , 𝑥 (2) ]. Note: can also train via SGD on energies!

Core concept: Energy! Each state 𝑥 of the net is assiciated a scalar value called energy: 𝐸 = −1/2 𝑥 𝑇 𝑊𝑥 Stored patterns correspond to the states with lowest energy. Inference searches for them! Also note that inference has to stop, because the energy is bounded from below and monotonically decreasing during inference.

A few things to remember • What is the energy function? • Why do we want symmetric weights? • Why do we must update neurons one by one? • If in doubt: see MacKay (http://www.inference.phy.cam.ac.uk/itprnn/ book.html) chapter 42 & 43!

How many patterns can we store? • The capacity of a hopfield net is limited. – Two nearby energy minimas can merge to create a new one.

• There are attractors that are not training patterns – a linear combination of patterns is likely to be an attractor too

• How to remove spurious atractors? – Unlearn them! Start with a random input. Run inference. Then change the weights to make the resulting attractor more energetic. – In a way, the network dreams of patterns and forgets them…

Input-output for Hopfield net • Divide the neurons into two sets: visible and hidden: Hidden units and connections Visible-to-hidden connections

Visible units and connections

• Set the visible units to a pattern. Set the hidden units randomly • Run inference only on the hidden units • In the end, the state of the hidden units will be an explanation for the visible ones! • Q: how to train hidden weights? Q: how to escape poor minimas during inference?

Probabilistic Hopfield net (aka Boltzmann Machine) • Change the inference rule: – From: Set 𝑋𝑘 = 𝑠𝑖𝑔𝑛(𝑊𝑘: 𝑋) 1 with probability 𝜎(2𝛽𝑊𝑘: 𝑋) – To: Sample 𝑋𝑘 = −1 with probability (1−𝜎(2𝛽𝑊𝑘: 𝑋)

• Thus Inference is a random walk through neuron activations • Still some configurations are more probable than other. After a vary long time (equilibrium), the probability of a configuration is: 1 𝑇 𝛽 𝑒 −𝛽𝐸(𝑥) 𝑒 2𝑥 𝑊𝑥 𝑃 𝑥|𝑊 = = 𝑍(𝑊) 𝑍(𝑊) where 𝐸 𝑥 = −1/2 𝑥 𝑇 𝑊𝑥 is the energy, 𝛽 is the inverse temperature (it smooths out probabilities), and 𝑍(𝑊) is the normalization constant also called the partition function

Boltzmann machine example • What is the conditional probability of a neuron being 1, given all other neurons? 𝑒 −𝛽𝐸(𝑥|𝑥𝑘=1) 𝑃 𝑥𝑘 = 1 𝑥¬𝑘 , 𝑊 = −𝛽𝐸(𝑥| = −𝛽𝐸(𝑥|𝑥𝑘 =−1 ) 𝑥𝑘 =1 ) 𝑒 +𝑒 1 = = 𝜎 2𝛽𝑊𝑘: 𝑥 −2𝛽𝑊 𝑥 𝑘: 1+𝑒 • The energy function properly defines the inference procedure (it is technically called Gibbs sampling). • We now see that the inference is just sampling until we’re bored one neuron conditioned on the other ones!

How to train a Boltzmann machine

• The Boltzmann machine gives probabilities • Train by maximizing log-likelihood! • After a few transformations we get (that’s a nice exercise): 𝜕 ln 𝜕𝑤𝑖𝑗

𝑁

𝑃(𝑥 𝑛=1

𝑛

(𝑛) (𝑛)

|𝑊) =

𝑥𝑖 𝑥𝑗

− 𝔼𝑃(𝑥|𝑊) [𝑥𝑖 𝑥𝑗 ] =

𝑛

= 𝑁 𝔼𝑥~Data 𝑥𝑖 𝑥𝑗 − 𝔼𝑃

𝑥𝑊

𝑥𝑖 𝑥𝑗

• The gradient has two terms: – The Hebbian term 𝔼𝑥~Data 𝑥𝑖 𝑥𝑗 that moves the weights towards correlations in the data This step “pulls down” the energy of data samples! – The un-learnig term 𝔼𝑃 𝑥 𝑊 𝑥𝑖 𝑥𝑗 that moves the weights away from correlations when the net is “dreaming” This step “pulls up” the energy of spurious attractors!

What about hidden units? 𝜕 ln 𝜕𝑤𝑖𝑗

𝑁

𝑃(𝑥

𝑛

|𝑊) = 𝑁 𝔼𝑥~Data 𝑥𝑖 𝑥𝑗 − 𝔼𝑃

𝑥𝑊

𝑥𝑖 𝑥𝑗

𝑛=1

Now we know how to learn hidden weights: • To compute 𝔼𝑥~Data 𝑥𝑖 𝑥𝑗 set the visible units to a data pattern, run a random walk over hidden units and observe the correlations • To compute 𝔼𝑃 𝑥 𝑊 𝑥𝑖 𝑥𝑗 run a random walk over all units and observe the correlations This is a very nice learning rule – the unit only looks at what its neighbors are doing (no weird backprop) and either strengthens the connection to mimics its neighbors, or during un-learning weakens the connection. That’s sort of a two-phase awake-sleeping operation.

NB: the name wake-sleep algorithm is used for a similar model, the Helmholtz machine

Restricted Boltzmann Machine Visible

Hidden

• Boltzmann learning with hidden units is slow, because each gradient evaluation requires two long Gibbs samplings (random walks). • Idea: forbid visible-to-visible and hidden-to-hidden connections! • Notice how: – hiddens are independent conditioned on the visibles, – visibles independent conditioned on hiddens

• Only need to run one Gibbs sampling (and it can be trincated to just a few steps) -> Fast training!

RBM stacking: The coolest idea of 2006

Article: https://www.cs.toronto.edu/~hinton/science.pdf Demo: http://www.cs.toronto.edu/~hinton/adi/index.htm

Intermezzo: autoencoding net

• Idea: train a net that pushes the data through a bottleneck (e.g. small layer, sparse layer…) • Prevent from learning the identity – sparsity, noise, other constraints. • Use the code for further tasks…

The Deep Learning revolution (2006-today) • Learn a hierarchy of features by greedily pretraining RMBs or autoencoders! • Combine together to initialize a deep net • Fine-tune via backprop. • Core idea: discover a hierarchy of transformations, yielding more and more abstract features! • Note: with the advances in training deep nets, this layerwise pre-training is not necessary to train deep nets. But good to know about it!

VAE: Variational auto-encoder

http://arxiv.org/abs/1312.6114

Encoder: Decoder: 𝑞 𝑧 𝑥 = 𝒩(𝜇, 𝜎) 𝑝 𝑧 and 𝑝 𝑥 𝑧

– learn a generator network (from some noise input, generate a data sample) – Learn an approximate inference network (for a sample, learn the noise input that would recreate it)

Injected noise

• Simplified idea:

GAN: Generative Adversarial Net – autoencoder a’rebours

http://arxiv.org/abs/1406.2661 http://www.cs.toronto.edu/~dtarlow/pos14/talks/goodfellow.pdf

VAE + GAN

Ladder network Problem with autoencoders: • We want a hierarchy of more and more abstract features. • But we also want perfect reconstruction. • Thus all information has to be carried all the way! Hard to abstract when you must remember! • Introduce “horizontal” connections that capture details • Inject noise • And done! http://arxiv.org/abs/1507.02672