May 4, 2017 - Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 10 -. May 4, 2017. Fei-Fei Li & Justin Joh
Lecture 10: Recurrent Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
1 May 4, 2017
Administrative A1 grades will go out soon A2 is due today (11:59pm) Midterm is in-class on Tuesday! We will send out details on where to go soon
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
2 May 4, 2017
Extra Credit: Train Game
More details on Piazza by early next week
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
3 May 4, 2017
Last Time: CNN Architectures
AlexNet
Figure copyright Kaiming He, 2016. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
4 May 4, 2017
Last Time: CNN Architectures Softmax FC 1000 Softmax
FC 4096
FC 1000
FC 4096
FC 4096
Pool
FC 4096
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
Pool
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
Pool
Pool
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
Pool
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Input
Input
VGG16
VGG19
GoogLeNet
Fei-Fei Li & Justin Johnson & Serena Yeung
Figure copyright Kaiming He, 2016. Reproduced with permission.
Lecture 10 -
5 May 4, 2017
Last Time: CNN Architectures Softmax FC 1000 Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64
relu
3x3 conv, 64 3x3 conv, 64
F(x) + x
...
conv
F(x)
relu conv
X identity
3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 64 3x3 conv, 64
X Residual block
3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Pool 7x7 conv, 64 / 2 Input
Fei-Fei Li & Justin Johnson & Serena Yeung
Figure copyright Kaiming He, 2016. Reproduced with permission.
Lecture 10 -
6 May 4, 2017
DenseNet
FractalNet
Softmax FC 1x1 conv, 64
Concat
Pool Dense Block 3 Conv
1x1 conv, 64
Pool Conv
Concat
Conv
Dense Block 2 Conv Pool
Concat
Conv Dense Block 1
Conv
Conv Input
Input
Dense Block Figures copyright Larsson et al., 2017. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
7 May 4, 2017
Last Time: CNN Architectures
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
8 May 4, 2017
Last Time: CNN Architectures AlexNet and VGG have tons of parameters in the fully connected layers
AlexNet: ~62M parameters FC6: 256x6x6 -> 4096: 38M params FC7: 4096 -> 4096: 17M params FC8: 4096 -> 1000: 4M params ~59M params in FC layers!
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
9 May 4, 2017
Today: Recurrent Neural Networks
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 10 May 4, 2017
“Vanilla” Neural Network
Vanilla Neural Networks Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 11 May 4, 2017
Recurrent Neural Networks: Process Sequences
e.g. Image Captioning image -> sequence of words Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 12 May 4, 2017
Recurrent Neural Networks: Process Sequences
e.g. Sentiment Classification sequence of words -> sentiment Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 13 May 4, 2017
Recurrent Neural Networks: Process Sequences
e.g. Machine Translation seq of words -> seq of words Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 14 May 4, 2017
Recurrent Neural Networks: Process Sequences
e.g. Video classification on frame level Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 15 May 4, 2017
Sequential Processing of Non-Sequence Data
Classify images by taking a series of “glimpses”
Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015. Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 16 May 4, 2017
Sequential Processing of Non-Sequence Data Generate images one piece at a time!
Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 17 May 4, 2017
Recurrent Neural Network
RNN
x
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
18 May 4, 2017
Recurrent Neural Network y
usually want to predict a vector at some time steps
RNN
x
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
19 May 4, 2017
Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step:
y
RNN
new state
old state input vector at some time step some function with parameters W
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
x
20 May 4, 2017
Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step:
y
RNN
Notice: the same function and the same set of parameters are used at every time step. Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
x
21 May 4, 2017
(Vanilla) Recurrent Neural Network The state consists of a single “hidden” vector h:
y
RNN
x
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
22 May 4, 2017
RNN: Computational Graph
h0
fW
h1
x1
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
23 May 4, 2017
RNN: Computational Graph
h0
fW x1
h1
fW
h2
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
24 May 4, 2017
RNN: Computational Graph
h0
fW x1
h1
fW
h2
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
fW
h3
…
hT
x3
Lecture 10 -
25 May 4, 2017
RNN: Computational Graph Re-use the same weight matrix at every time-step
h0
W
fW x1
h1
fW
h2
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
fW
h3
…
hT
x3
Lecture 10 -
26 May 4, 2017
RNN: Computational Graph: Many to Many y2
y1
h0
W
fW x1
h1
fW
h2
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
y3
fW
h3
yT
…
hT
x3
Lecture 10 -
27 May 4, 2017
RNN: Computational Graph: Many to Many
h0
W
fW x1
y1
L1
y2
L2
y3
L3
yT
h1
fW
h2
fW
h3
…
hT
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
LT
x3
Lecture 10 -
28 May 4, 2017
L
RNN: Computational Graph: Many to Many
h0
W
fW x1
y1
L1
y2
L2
y3
L3
yT
h1
fW
h2
fW
h3
…
hT
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
LT
x3
Lecture 10 -
29 May 4, 2017
RNN: Computational Graph: Many to One y
h0
W
fW x1
h1
fW
h2
x2
Fei-Fei Li & Justin Johnson & Serena Yeung
fW
h3
…
hT
x3
Lecture 10 -
30 May 4, 2017
RNN: Computational Graph: One to Many y3
y3
h0
W
fW
h1
fW
h2
y3
fW
h3
yT
…
hT
x
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
31 May 4, 2017
Sequence to Sequence: Many-to-one + one-to-many Many to one: Encode input sequence in a single vector
h 0
W
fW
h 1
fW
h 2
fW
x
x
x
1
2
3
h
…
3
h T
1
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
32 May 4, 2017
Sequence to Sequence: Many-to-one + one-to-many One to many: Produce output sequence from single input vector Many to one: Encode input sequence in a single vector
h 0
W
fW
h 1
fW
h 2
fW
x
x
x
1
2
3
h
…
3
1
Fei-Fei Li & Justin Johnson & Serena Yeung
h
fW
T
y
y
1
2
h 1
fW
h 2
fW
…
W 2
Lecture 10 -
33 May 4, 2017
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
34 May 4, 2017
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
35 May 4, 2017
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
36 May 4, 2017
Example: Character-level Language Model Sampling
Sample Softmax
“e”
“l”
“l”
“o”
.03 .13 .00 .84
.25 .20 .05 .50
.11 .17 .68 .03
.11 .02 .08 .79
Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
37 May 4, 2017
Example: Character-level Language Model Sampling
Sample Softmax
“e”
“l”
“l”
“o”
.03 .13 .00 .84
.25 .20 .05 .50
.11 .17 .68 .03
.11 .02 .08 .79
Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
38 May 4, 2017
Example: Character-level Language Model Sampling
Sample Softmax
“e”
“l”
“l”
“o”
.03 .13 .00 .84
.25 .20 .05 .50
.11 .17 .68 .03
.11 .02 .08 .79
Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
39 May 4, 2017
Example: Character-level Language Model Sampling
Sample Softmax
“e”
“l”
“l”
“o”
.03 .13 .00 .84
.25 .20 .05 .50
.11 .17 .68 .03
.11 .02 .08 .79
Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
40 May 4, 2017
Backpropagation through time
Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
Loss
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
41 May 4, 2017
Truncated Backpropagation through time Loss
Run forward and backward through chunks of the sequence instead of whole sequence
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
42 May 4, 2017
Truncated Backpropagation through time Loss
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
43 May 4, 2017
Truncated Backpropagation through time Loss
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
44 May 4, 2017
min-char-rnn.py gist: 112 lines of Python
(https://gist.github.com/karpathy/d4dee 566867f8291f086)
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
45 May 4, 2017
y
RNN
x
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
46 May 4, 2017
at first: train more
train more
train more
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
47 May 4, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
48 May 4, 2017
The Stacks Project: open source algebraic geometry textbook
Latex source Fei-Fei Li & Justin Johnson & Serena Yeung
http://stacks.math.columbia.edu/ The stacks project is licensed under the GNU Free Documentation License
Lecture 10 -
49 May 4, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
50 May 4, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
51 May 4, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
52 May 4, 2017
Generated C code
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
53 May 4, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
54 May 4, 2017
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
55 May 4, 2017
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
56 May 4, 2017
Searching for interpretable cells
Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
57 May 4, 2017
Searching for interpretable cells
quote detection cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
58 May 4, 2017
Searching for interpretable cells
line length tracking cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
59 May 4, 2017
Searching for interpretable cells
if statement cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
60 May 4, 2017
Searching for interpretable cells
quote/comment cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
61 May 4, 2017
Searching for interpretable cells
code depth cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
62 May 4, 2017
Image Captioning
Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015. Reproduced for educational purposes.
Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
63 May 4, 2017
Recurrent Neural Network
Convolutional Neural Network Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
64 May 4, 2017
test image
This image is CC0 public domain
test image
test image
X
test image
x0
test image
y0
before: h = tanh(Wxh * x + Whh * h) h0
Wih
now: h = tanh(Wxh * x + Whh * h + Wih * v) x0
v
test image
y0
sample! h0
x0
straw
test image
y0
y1
h0
h1
x0
straw
test image
y0
y1
h0
h1
x0
straw
sample!
hat
test image
y0
y1
y2
h0
h1
h2
x0
straw
hat
test image
y0
y1
h0
h1
x0
straw
y2
h2
hat
sample token => finish.
Image Captioning: Example Results
A cat sitting on a suitcase on the floor
Two people walking on the beach with surfboards
A cat is sitting on a tree branch
A tennis player in action on the court
Fei-Fei Li & Justin Johnson & Serena Yeung
A dog is running in the grass with a frisbee
Two giraffes standing in a grassy field
Lecture 10 -
Captions generated using neuraltalk2 All images are CC0 Public domain: cat suitcase, cat tree, dog, bear, surfers, tennis, giraffe, motorcycle
A white teddy bear sitting in the grass
A man riding a dirt bike on a dirt track
75 May 4, 2017
Image Captioning: Failure Cases
Captions generated using neuraltalk2 All images are CC0 Public domain: fur coat, handstand, spider web, baseball
A bird is perched on a tree branch
A woman is holding a cat in her hand A man in a baseball uniform throwing a ball
A woman standing on a beach holding a surfboard A person holding a computer mouse on a desk
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
76 May 4, 2017
Image Captioning with Attention RNN focuses its attention at a different spatial location when generating each word
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
77 May 4, 2017
Image Captioning with Attention
CNN Image: HxWx3
h0
Features: LxD
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
78 May 4, 2017
Image Captioning with Attention Distribution over L locations a1
CNN Image: HxWx3
h0
Features: LxD
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
79 May 4, 2017
Image Captioning with Attention Distribution over L locations a1
CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
h0
Features: LxD
Weighted features: D
z1
Weighted combination of features
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
80 May 4, 2017
Image Captioning with Attention Distribution over L locations a1
CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
h0
Features: LxD
Weighted features: D
h1
z1
Weighted combination of features
Fei-Fei Li & Justin Johnson & Serena Yeung
y1
First word
Lecture 10 -
81 May 4, 2017
Image Captioning with Attention Distribution over L locations a1
CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
a2
h0
Features: LxD
Weighted features: D
Distribution over vocab d1
h1
z1
Weighted combination of features
Fei-Fei Li & Justin Johnson & Serena Yeung
y1
First word
Lecture 10 -
82 May 4, 2017
Image Captioning with Attention Distribution over L locations a1
CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
a2
h0
Features: LxD
Weighted features: D
Distribution over vocab d1
h1
z1
Weighted combination of features
Fei-Fei Li & Justin Johnson & Serena Yeung
h2
y1
z2
y2
First word
Lecture 10 -
83 May 4, 2017
Image Captioning with Attention Distribution over L locations a1
CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
a2
h0
Features: LxD
Weighted features: D
Distribution over vocab d1
a3
h1
z1
Weighted combination of features
Fei-Fei Li & Justin Johnson & Serena Yeung
d2
h2
y1
z2
y2
First word
Lecture 10 -
84 May 4, 2017
Image Captioning with Attention
Soft attention
Hard attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
85 May 4, 2017
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
86 May 4, 2017
Visual Question Answering
Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015 Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figure from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
87 May 4, 2017
Visual Question Answering: RNNs with Attention
Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figures from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
88 May 4, 2017
Multilayer RNNs
LSTM:
depth
time
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
89 May 4, 2017
Vanilla RNN Gradient Flow
W ht-1
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
tanh stack
ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
90 May 4, 2017
Vanilla RNN Gradient Flow
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
Backpropagation from ht to ht-1 multiplies by W (actually WhhT)
W ht-1
tanh stack
ht
xt
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
91 May 4, 2017
Vanilla RNN Gradient Flow
h0
h1 x1
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
h2 x2
h3 x3
h4 x4
Computing gradient of h0 involves many factors of W (and repeated tanh) Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
92 May 4, 2017
Vanilla RNN Gradient Flow
h0
h1 x1
Computing gradient of h0 involves many factors of W (and repeated tanh)
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
h2
h3
x2
x3
h4 x4
Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
93 May 4, 2017
Vanilla RNN Gradient Flow
h0
h1 x1
Computing gradient of h0 involves many factors of W (and repeated tanh)
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
h2
h3
x2
x3
Largest singular value > 1: Exploding gradients
h4 x4
Gradient clipping: Scale gradient if its norm is too big
Largest singular value < 1: Vanishing gradients
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
94 May 4, 2017
Vanilla RNN Gradient Flow
h0
h1 x1
Computing gradient of h0 involves many factors of W (and repeated tanh)
Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013
h2
h3
x2
x3
h4 x4
Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients
Fei-Fei Li & Justin Johnson & Serena Yeung
Change RNN architecture
Lecture 10 -
95 May 4, 2017
Long Short Term Memory (LSTM) Vanilla RNN
LSTM
Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
96 May 4, 2017
Long Short Term Memory (LSTM) f: Forget gate, Whether to erase cell i: Input gate, whether to write to cell g: Gate gate (?), How much to write to cell o: Output gate, How much to reveal cell
[Hochreiter et al., 1997] vector from below (x) x
sigmoid
i
h
sigmoid
f
sigmoid
o
tanh
g
W
vector from before (h)
4h x 2h
4h
Fei-Fei Li & Justin Johnson & Serena Yeung
4*h Lecture 10 -
97 May 4, 2017
Long Short Term Memory (LSTM) [Hochreiter et al., 1997]
☉
ct-1
f i g
W ht-1
stack
o
+
ct
☉
tanh
☉
ht
xt Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
98 May 4, 2017
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]
☉
ct-1
f i g
W ht-1
stack
o
+
ct
☉
tanh
☉
Backpropagation from ct to ct-1 only elementwise multiplication by f, no matrix multiply by W
ht
xt Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 -
99 May 4, 2017
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]
Uninterrupted gradient flow! c0
c1
Fei-Fei Li & Justin Johnson & Serena Yeung
c2
c3
Lecture 10 - 100 May 4, 2017
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]
Uninterrupted gradient flow!
Softmax FC 1000 Pool
Lecture 10 - 101 May 4, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung
3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64
... 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Pool 7x7 conv, 64 / 2 Input
Similar to ResNet!
c3 c2 c1 c0
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]
Uninterrupted gradient flow! c0
c1
c2
c3
In between: Highway Networks Softmax FC 1000 Pool
Fei-Fei Li & Justin Johnson & Serena Yeung
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
...
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128 / 2
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64 3x3 conv, 64
Pool 7x7 conv, 64 / 2 Input
Similar to ResNet!
Srivastava et al, “Highway Networks”, ICML DL Workshop 2015
Lecture 10 - 102 May 4, 2017
Other RNN Variants
[An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015]
GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014]
[LSTM: A Search Space Odyssey, Greff et al., 2015]
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 103 May 4, 2017
Summary - RNNs allow a lot of flexibility in architecture design - Vanilla RNNs are simple but don’t work very well - Common to use LSTM or GRU: their additive interactions improve gradient flow - Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM) - Better/simpler architectures are a hot topic of current research - Better understanding (both theoretical and empirical) is needed.
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 104 May 4, 2017
Next time: Midterm! Then Detection and Segmentation
Fei-Fei Li & Justin Johnson & Serena Yeung
Lecture 10 - 105 May 4, 2017