use dropout during training use massive amounts of data for training. (recall our ... variational autoencoders (VAEs) generative adversarial networks (GANs) ...
Pattern Recognition Prof. Christian Bauckhage
outline lecture 22 recap multilayer perceptrons back propagation best practices recurrent neural networks deep learning summary
mathematical neuron
y
⇔ synaptic summation and (non-linear) activation y(x) = f wT x = f s
f (s) w0
where
x0
x← 1 x w ← w0 w
x1
w1
w2 x2
wm ...
xm
traditional activation functions
linear
1
f (s) = s 0 −4
logistic function f (s) =
−2
1 1 + e−βs
0
2
4
0
2
4
−1 1
hyperbolic tangent 0
f (s) =
eβs
e−βs
− eβs + e−βs
−4
−2
−1
more recent activation functions
rectified linear f (s) = max 0, s
2
1
softplus f (s) = ln 1 + es
0 −4
−2
0
2
4
perceptron learning rule (PLR)
assume labeled training data D = yi =
xi , yi
n i=1
where
+1, if xi ∈ Ω1 −1, if xi ∈ Ω2
let f (·) = id(·) and run the following algorithm initialize w while not converged randomly select xi if yi · xTi w < 0 w = w + yi · xi
that is, use ∆w =
1 2
T sign xi w − yi · xi
// test for mistake // update only if mistake
Hebbian learning
synaptic weights should be adapted proportional to correlations between pre- and post-synaptic activity
for example a perceptron that learns by means of Oja’s rule ∆w = η f wT x x − η f 2 wT x w can identify principal components if f (·) = id(·)
earning via iterative error minimization
consider a non-linear neuron y(x) = f wT x = f (s) where, for example f (s) = tanh β s
and d f (s) = β 1 − f 2 (s) ds
gradient descend
we have 2 1 X yi − tanh β wT xi E D, w = 2 i
and ∂ ∂E X = tanh β wT xi −1 yi − tanh β wT xi ∂w ∂w i
= −β
X yi − tanh β wT xi 1 − tanh2 β wT xi xi i
what a perceptron can do
x2
y w0 1
w0 w x1
w1 x1
w2 x2
what a perceptron can’t do
x2
y w0 1 x1
w1 x1
w2 x2
what a multilayer perceptron can do
x2
y
x1
1
f
f
1
x1
x2
note
in order for this to work / make sense, the activation function f must be non-linear otherwise, we would just have yet another linear function y y = W2 W1 x = W x W
1
y2
2
1
f
f
x1
x2
W1
1
x3
Theorem (universal approximation theorem) a feed forward MLP y(x) with a single layer of finitely many hidden neurons and non-linear monotonous activation functions can approximate any function g : Rm ⊃ X → R, X compact, up to arbitrary precision, i.e. max y(x) − g(x) < x∈X
for any Hornik, Stinchcombe, and White, Neural Networks, 2(5), 1989 Cybenko, Mathematics of Control, Signals, and Systems, 2(4), 1989
question so where is / was the problem ?
question so where is / was the problem ?
answer appropriate parameters W = W1, W1, . . . , WL need to be determined
usual approach n learn them from labeled data (x[i], y[i]) i=1 by minimizing n 1X 2 E W = y[i] − y x[i], W 2 i=1
back propagation
⇔ recursive estimation of weights in a multilayer perceptron originally due to Bryson and Ho (1969) (for optimal control of dynamic systems) repeatedly rediscovered for MLP training; in particular by Werbos, Rumelhart, Hinton, and Williams (1974)
back propagation
randomly initialize weights
j=0
j=1
j=2
j=L
repeat propagate training data through the network evaluate overall error E update weights using gradient descent wklj ← wklj − η
∂E ∂wklj
with step size η
until E becomes small
.. .
···
.. . .. .
back propagation
for stochastic gradient descend, we rewrite E=
2 1 X 1 X = y[i] − y x[i], W E[i] 2 2 i
i
and look at the partial derivatives of the E[i] w.r.t to the weights wklj
back propagation
for the last layer L, we have ∂E[i] ∂flL ∂slL ∂E[i] = ∂wklL ∂flL ∂slL ∂wklL
(1)
back propagation
for the last layer L, we have ∂E[i] ∂flL ∂slL ∂E[i] = ∂wklL ∂flL ∂slL ∂wklL
=
2 ∂ 1X yk [i] − fkL L ∂fl 2
(1)
(2)
k
∂ · L tanh slL ∂sl ∂ X L L−1 wkl fl · ∂wklL k
(3) (4)
back propagation
for (2), (3), and (4), we find respectively ∂E[i] = − yl [i] − flL L ∂fl ∂flL 2 = 1 − flL L ∂sl ∂slL = flL−1 ∂wklL
back propagation
so far, we therefore have ∂E[i] 2 = − yl [i] − flL · 1 − flL · flL−1 L ∂wkl = −δLl · flL−1
back propagation
for the last but one layer L − 1, we must consider ∂E[i] ∂E[i] 2 = L−1 · 1 − flL−1 · flL−2 L−1 ∂wkl ∂fl and we note that X ∂E[i] ∂f L ∂E[i] k = ∂fkL ∂flL−1 ∂flL−1 k
j −1
j
j+1
k−1 k k+1
back propagation
in general, we therefore have wklj ← wklj + η · δlj · flj−1
j = L, L − 1, . . . , 1
where 2 δLl = yl [i] − flL · 1 − flL and δlj−1 =
X k
δkj wklj 1 − flj−1
2
j = L, L − 1, . . . , 2
note
this is it . . .
note
this is it . . . but na¨ıve BP is a recipe for run-time disasters, slow to converge, and prone to oscillation ⇒ numerous possible improvements momentum terms ∆W(t + 1) = −η
dE + µ ∆W(t) dW(t)
weight decay (L2 regularization) dE ∆W = −η + λW dW
note
automatic estimation of learning rate η super SAB heuristic resiliant backpropagation (RProp), . . .
variants for computation of gradients conjugate gradients (quasi) Newton methods Levenberg-Marquardt methods, . . .
different objective functions cross entropy, . . .
note
⇒ “simple” supervised training of a “simple” feed-forward multilayer perceptron is an art rather than a science ;-)
note
all of the following are from Y. LeCun, L. Bottou, G.B. Orr, and K.-R. Muller, “Efficient ¨ BackProp” in G.B. Orr and K.-R. Muller (eds), “Neural ¨ Networks: Tricks of the Trade”, Springer, 1998
stochastic- instead of batch learning
note that computing the average gradient ∆W = −η
∂E ∂W
requires a pass over the whole batch of training data one can often get faster and better solutions through updates based on single data points or mini-batches ∂E[i] ∂W ∂E i : j ∆W = −η ∂W ∆W = −η
shuffle the training data
networks learn fastest from “unexpected” examples therefore shuffle the training data s.t. successive examples are from different classes prefer training examples that produce large errors over those producing small errors
normalize the input
convergence is typically faster if each of the variables (dimensions) in the training data are of zero mean and unit variance
PCA / ZCA whitening n for data xi i=1 ⊂ Rm , compute
4
2
0 −4
−2
0
−2
−4
2
4
PCA / ZCA whitening n for data xi i=1 ⊂ Rm , compute
4
2
1
xi ← xi − µ 0 −4
−2
0
−2
−4
2
4
PCA / ZCA whitening n for data xi i=1 ⊂ Rm , compute
4
2
1
xi ← xi − µ
2
xi ← UT xi where C = UΛUT
0 −4
−2
0
−2
−4
2
4
PCA / ZCA whitening n for data xi i=1 ⊂ Rm , compute
4
2
1
xi ← xi − µ
2
xi ← UT xi where C = UΛUT
3
xi ← L xi where L = Λ−1/2
0 −4
−2
0
−2
−4
2
4
PCA / ZCA whitening n for data xi i=1 ⊂ Rm , compute
4
2
1
xi ← xi − µ
2
xi ← UT xi where C = UΛUT
3
xi ← L xi where L = Λ−1/2
4
xi ← U xi
0 −4
−2
0
−2
−4
2
4
choose the activation function
recall that outputs of one layer are inputs to the next inputs of of zero mean and unit variance are good
choose the activation function
recall that outputs of one layer are inputs to the next inputs of of zero mean and unit variance are good the once popular logistic function f (z) = 1 + e−s
−1
does not accomplish this, because ∀ s : f (s) > 0
choose the activation function
2
the hyperbolic tangent is symmetric about the origin
f (s) = 1.7159 · tanh
2 3s
0 −4
in particular, the function
−2
has the following properties f (±1) = ±1 argmaxs
d2 f dz2
= ±1
if s is of zero mean and unit variance, then f (s) will be of zero mean and unit variance
0
−2
2
4
choose the activation function
2
it may be beneficial to consider
0 −4
−2
0
f (s) = tanh (s) + s −2
to escape from plateaus
2
4
choose the target values
training will drive outputs as close as possible to targets ⇒ if target values are large, weights will have to become large ⇒ for sigmoidal activation f (s), gradients will become small 2 therefore, set target values to points where dds2f is maximal
1
1
0 −4
−2
1
0 0
2
−1
f (s) = tanh(s)
4
−4
−2
0 0
−1
f 0 (s)
2
4
−4
−2
0
−1
f 00 (s)
2
4
initialize the weights
recall, large weights will cause sigmoidal activation to saturate (⇔ small gradients, slow learning) likewise for very small weights therefore, choose weights neither to small nor to large assuming that m inputs to a neuron are of zero mean and unit variance, initialize it weights by sampling from a Gaussian with zero mean and variance 1 σ= √ m
recurrent neural networks
note
feed forward networks are stateless
recurrent networks are statefull
recurrent network
⇔ truly universal function ⇔ dynamical system
σt+1 = σt + F Wσt ⇔ challenging to train
deep learning
convolutional neural network
neural architecture tailored towards image analysis popularized by LeCun et al. (1990)
source: LeCun et al. (1995)
convolutional neural network
OK, the idea is “not new”, so why the excitement?
convolutional neural network
OK, the idea is “not new”, so why the excitement? Krizhevsky, Sutskever, and Hinton (2012) revolutionized what can be expected from neural networks they used a neural network with 5 convolutional layers (some followed by pooling layers) followed by a fully connected MLP of 3 layers a total of 650,000 neurons and 60,000,000 parameters
and achieved error rates on the ImageNet benchmark (15,000,000 images from 22,000 categories) that were previously unheard of
convolutional neural network
source: L. Brown, nvidia blog, 2014
current best practices for deep learning
require weights in each convolutional layer to be shared use non-saturating activation functions, for instance rectified linear units f (s) = max 0, s
current best practices for deep learning
require weights in each convolutional layer to be shared use non-saturating activation functions, for instance rectified linear units f (s) = max 0, s use dropout during training use massive amounts of data for training (recall our study of the VC dimension) train on GPUs
numerous other breakthroughs
speech / text understanding speech / text translation genome analysis game AI ...
things are getting crazier by the day . . .
Google’s tensor flow, Dec 2015
Microsoft’s CNTK, Jan 2016
other recent breakthroughs
variational autoencoders (VAEs) generative adversarial networks (GANs)
further neural architectures
models not based on perceptrons echo state networks radial basis networks y(x) =
X i
"
#
wi − x 2 αi exp − 2σ2
self-organizing maps, neural gasses, . . . associative memories, Hopfield networks, . . . (restricted) Bolzmann machines, deep belief networks, . . .
neural networks are not that special
observe
a deep net computes a function ! L 2 1 y = f W f ... f W f W x
20
21
WL
.. .
17
18
19
14
15
16
10
11
12
13
6
7
8
9
W2
W1 1
2
3
4
5
observe
a deep net computes a function ! L 2 1 y = f W f ... f W f W x = f WL ϕ x
20
21
WL
.. .
17
18
19
14
15
16
10
11
12
13
6
7
8
9
W2
W1 1
2
3
4
5
observe
a deep net computes a function ! L 2 1 y = f W f ... f W f W x = f WL ϕ x
20
21
WL
.. .
17
18
19
14
15
16
10
11
12
13
6
7
8
9
W2
W1 1
⇔ it is just another kernel machine
2
3
4
5
summary
we now know about
multilayer perceptrons and the back propagation algorithm best practices for training (deep) feed forward networks the need to learn more about all this ;-)