Lecture 10: Recurrent Neural Networks - CS231n

54 downloads 224 Views 6MB Size Report
May 4, 2017 - Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 10 -. May 4, 2017. Fei-Fei Li & Justin Joh
Lecture 10: Recurrent Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

1 May 4, 2017

Administrative A1 grades will go out soon A2 is due today (11:59pm) Midterm is in-class on Tuesday! We will send out details on where to go soon

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

2 May 4, 2017

Extra Credit: Train Game

More details on Piazza by early next week

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

3 May 4, 2017

Last Time: CNN Architectures

AlexNet

Figure copyright Kaiming He, 2016. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

4 May 4, 2017

Last Time: CNN Architectures Softmax FC 1000 Softmax

FC 4096

FC 1000

FC 4096

FC 4096

Pool

FC 4096

3x3 conv, 512

Pool

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

Pool

Pool

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

Pool

Pool

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

Pool

Pool

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

Pool

Pool

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

Input

Input

VGG16

VGG19

GoogLeNet

Fei-Fei Li & Justin Johnson & Serena Yeung

Figure copyright Kaiming He, 2016. Reproduced with permission.

Lecture 10 -

5 May 4, 2017

Last Time: CNN Architectures Softmax FC 1000 Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

relu

3x3 conv, 64 3x3 conv, 64

F(x) + x

...

conv

F(x)

relu conv

X identity

3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 64 3x3 conv, 64

X Residual block

3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Pool 7x7 conv, 64 / 2 Input

Fei-Fei Li & Justin Johnson & Serena Yeung

Figure copyright Kaiming He, 2016. Reproduced with permission.

Lecture 10 -

6 May 4, 2017

DenseNet

FractalNet

Softmax FC 1x1 conv, 64

Concat

Pool Dense Block 3 Conv

1x1 conv, 64

Pool Conv

Concat

Conv

Dense Block 2 Conv Pool

Concat

Conv Dense Block 1

Conv

Conv Input

Input

Dense Block Figures copyright Larsson et al., 2017. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

7 May 4, 2017

Last Time: CNN Architectures

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

8 May 4, 2017

Last Time: CNN Architectures AlexNet and VGG have tons of parameters in the fully connected layers

AlexNet: ~62M parameters FC6: 256x6x6 -> 4096: 38M params FC7: 4096 -> 4096: 17M params FC8: 4096 -> 1000: 4M params ~59M params in FC layers!

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

9 May 4, 2017

Today: Recurrent Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 10 May 4, 2017

“Vanilla” Neural Network

Vanilla Neural Networks Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 11 May 4, 2017

Recurrent Neural Networks: Process Sequences

e.g. Image Captioning image -> sequence of words Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 12 May 4, 2017

Recurrent Neural Networks: Process Sequences

e.g. Sentiment Classification sequence of words -> sentiment Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 13 May 4, 2017

Recurrent Neural Networks: Process Sequences

e.g. Machine Translation seq of words -> seq of words Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 14 May 4, 2017

Recurrent Neural Networks: Process Sequences

e.g. Video classification on frame level Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 15 May 4, 2017

Sequential Processing of Non-Sequence Data

Classify images by taking a series of “glimpses”

Ba, Mnih, and Kavukcuoglu, “Multiple Object Recognition with Visual Attention”, ICLR 2015. Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 16 May 4, 2017

Sequential Processing of Non-Sequence Data Generate images one piece at a time!

Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 17 May 4, 2017

Recurrent Neural Network

RNN

x

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

18 May 4, 2017

Recurrent Neural Network y

usually want to predict a vector at some time steps

RNN

x

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

19 May 4, 2017

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step:

y

RNN

new state

old state input vector at some time step some function with parameters W

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

x

20 May 4, 2017

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step:

y

RNN

Notice: the same function and the same set of parameters are used at every time step. Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

x

21 May 4, 2017

(Vanilla) Recurrent Neural Network The state consists of a single “hidden” vector h:

y

RNN

x

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

22 May 4, 2017

RNN: Computational Graph

h0

fW

h1

x1

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

23 May 4, 2017

RNN: Computational Graph

h0

fW x1

h1

fW

h2

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

24 May 4, 2017

RNN: Computational Graph

h0

fW x1

h1

fW

h2

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

fW

h3



hT

x3

Lecture 10 -

25 May 4, 2017

RNN: Computational Graph Re-use the same weight matrix at every time-step

h0

W

fW x1

h1

fW

h2

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

fW

h3



hT

x3

Lecture 10 -

26 May 4, 2017

RNN: Computational Graph: Many to Many y2

y1

h0

W

fW x1

h1

fW

h2

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

y3

fW

h3

yT



hT

x3

Lecture 10 -

27 May 4, 2017

RNN: Computational Graph: Many to Many

h0

W

fW x1

y1

L1

y2

L2

y3

L3

yT

h1

fW

h2

fW

h3



hT

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

LT

x3

Lecture 10 -

28 May 4, 2017

L

RNN: Computational Graph: Many to Many

h0

W

fW x1

y1

L1

y2

L2

y3

L3

yT

h1

fW

h2

fW

h3



hT

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

LT

x3

Lecture 10 -

29 May 4, 2017

RNN: Computational Graph: Many to One y

h0

W

fW x1

h1

fW

h2

x2

Fei-Fei Li & Justin Johnson & Serena Yeung

fW

h3



hT

x3

Lecture 10 -

30 May 4, 2017

RNN: Computational Graph: One to Many y3

y3

h0

W

fW

h1

fW

h2

y3

fW

h3

yT



hT

x

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

31 May 4, 2017

Sequence to Sequence: Many-to-one + one-to-many Many to one: Encode input sequence in a single vector

h 0

W

fW

h 1

fW

h 2

fW

x

x

x

1

2

3

h



3

h T

1

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

32 May 4, 2017

Sequence to Sequence: Many-to-one + one-to-many One to many: Produce output sequence from single input vector Many to one: Encode input sequence in a single vector

h 0

W

fW

h 1

fW

h 2

fW

x

x

x

1

2

3

h



3

1

Fei-Fei Li & Justin Johnson & Serena Yeung

h

fW

T

y

y

1

2

h 1

fW

h 2

fW



W 2

Lecture 10 -

33 May 4, 2017

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

34 May 4, 2017

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

35 May 4, 2017

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: “hello” Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

36 May 4, 2017

Example: Character-level Language Model Sampling

Sample Softmax

“e”

“l”

“l”

“o”

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

37 May 4, 2017

Example: Character-level Language Model Sampling

Sample Softmax

“e”

“l”

“l”

“o”

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

38 May 4, 2017

Example: Character-level Language Model Sampling

Sample Softmax

“e”

“l”

“l”

“o”

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

39 May 4, 2017

Example: Character-level Language Model Sampling

Sample Softmax

“e”

“l”

“l”

“o”

.03 .13 .00 .84

.25 .20 .05 .50

.11 .17 .68 .03

.11 .02 .08 .79

Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

40 May 4, 2017

Backpropagation through time

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

Loss

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

41 May 4, 2017

Truncated Backpropagation through time Loss

Run forward and backward through chunks of the sequence instead of whole sequence

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

42 May 4, 2017

Truncated Backpropagation through time Loss

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

43 May 4, 2017

Truncated Backpropagation through time Loss

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

44 May 4, 2017

min-char-rnn.py gist: 112 lines of Python

(https://gist.github.com/karpathy/d4dee 566867f8291f086)

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

45 May 4, 2017

y

RNN

x

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

46 May 4, 2017

at first: train more

train more

train more

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

47 May 4, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

48 May 4, 2017

The Stacks Project: open source algebraic geometry textbook

Latex source Fei-Fei Li & Justin Johnson & Serena Yeung

http://stacks.math.columbia.edu/ The stacks project is licensed under the GNU Free Documentation License

Lecture 10 -

49 May 4, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

50 May 4, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

51 May 4, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

52 May 4, 2017

Generated C code

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

53 May 4, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

54 May 4, 2017

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

55 May 4, 2017

Searching for interpretable cells

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

56 May 4, 2017

Searching for interpretable cells

Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

57 May 4, 2017

Searching for interpretable cells

quote detection cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

58 May 4, 2017

Searching for interpretable cells

line length tracking cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

59 May 4, 2017

Searching for interpretable cells

if statement cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

60 May 4, 2017

Searching for interpretable cells

quote/comment cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

61 May 4, 2017

Searching for interpretable cells

code depth cell Karpathy, Johnson, and Fei-Fei: Visualizing and Understanding Recurrent Networks, ICLR Workshop 2016 Figures copyright Karpathy, Johnson, and Fei-Fei, 2015; reproduced with permission

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

62 May 4, 2017

Image Captioning

Figure from Karpathy et a, “Deep Visual-Semantic Alignments for Generating Image Descriptions”, CVPR 2015; figure copyright IEEE, 2015. Reproduced for educational purposes.

Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

63 May 4, 2017

Recurrent Neural Network

Convolutional Neural Network Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

64 May 4, 2017

test image

This image is CC0 public domain

test image

test image

X

test image

x0

test image

y0

before: h = tanh(Wxh * x + Whh * h) h0

Wih

now: h = tanh(Wxh * x + Whh * h + Wih * v) x0

v

test image

y0

sample! h0

x0

straw

test image

y0

y1

h0

h1

x0

straw

test image

y0

y1

h0

h1

x0

straw

sample!

hat

test image

y0

y1

y2

h0

h1

h2

x0

straw

hat

test image

y0

y1

h0

h1

x0

straw

y2

h2

hat

sample token => finish.

Image Captioning: Example Results

A cat sitting on a suitcase on the floor

Two people walking on the beach with surfboards

A cat is sitting on a tree branch

A tennis player in action on the court

Fei-Fei Li & Justin Johnson & Serena Yeung

A dog is running in the grass with a frisbee

Two giraffes standing in a grassy field

Lecture 10 -

Captions generated using neuraltalk2 All images are CC0 Public domain: cat suitcase, cat tree, dog, bear, surfers, tennis, giraffe, motorcycle

A white teddy bear sitting in the grass

A man riding a dirt bike on a dirt track

75 May 4, 2017

Image Captioning: Failure Cases

Captions generated using neuraltalk2 All images are CC0 Public domain: fur coat, handstand, spider web, baseball

A bird is perched on a tree branch

A woman is holding a cat in her hand A man in a baseball uniform throwing a ball

A woman standing on a beach holding a surfboard A person holding a computer mouse on a desk

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

76 May 4, 2017

Image Captioning with Attention RNN focuses its attention at a different spatial location when generating each word

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

77 May 4, 2017

Image Captioning with Attention

CNN Image: HxWx3

h0

Features: LxD

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

78 May 4, 2017

Image Captioning with Attention Distribution over L locations a1

CNN Image: HxWx3

h0

Features: LxD

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

79 May 4, 2017

Image Captioning with Attention Distribution over L locations a1

CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

h0

Features: LxD

Weighted features: D

z1

Weighted combination of features

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

80 May 4, 2017

Image Captioning with Attention Distribution over L locations a1

CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

h0

Features: LxD

Weighted features: D

h1

z1

Weighted combination of features

Fei-Fei Li & Justin Johnson & Serena Yeung

y1

First word

Lecture 10 -

81 May 4, 2017

Image Captioning with Attention Distribution over L locations a1

CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

a2

h0

Features: LxD

Weighted features: D

Distribution over vocab d1

h1

z1

Weighted combination of features

Fei-Fei Li & Justin Johnson & Serena Yeung

y1

First word

Lecture 10 -

82 May 4, 2017

Image Captioning with Attention Distribution over L locations a1

CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

a2

h0

Features: LxD

Weighted features: D

Distribution over vocab d1

h1

z1

Weighted combination of features

Fei-Fei Li & Justin Johnson & Serena Yeung

h2

y1

z2

y2

First word

Lecture 10 -

83 May 4, 2017

Image Captioning with Attention Distribution over L locations a1

CNN Image: HxWx3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

a2

h0

Features: LxD

Weighted features: D

Distribution over vocab d1

a3

h1

z1

Weighted combination of features

Fei-Fei Li & Justin Johnson & Serena Yeung

d2

h2

y1

z2

y2

First word

Lecture 10 -

84 May 4, 2017

Image Captioning with Attention

Soft attention

Hard attention

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

85 May 4, 2017

Image Captioning with Attention

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

86 May 4, 2017

Visual Question Answering

Agrawal et al, “VQA: Visual Question Answering”, ICCV 2015 Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figure from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

87 May 4, 2017

Visual Question Answering: RNNs with Attention

Zhu et al, “Visual 7W: Grounded Question Answering in Images”, CVPR 2016 Figures from Zhu et al, copyright IEEE 2016. Reproduced for educational purposes.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

88 May 4, 2017

Multilayer RNNs

LSTM:

depth

time

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

89 May 4, 2017

Vanilla RNN Gradient Flow

W ht-1

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

tanh stack

ht

xt

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

90 May 4, 2017

Vanilla RNN Gradient Flow

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Backpropagation from ht to ht-1 multiplies by W (actually WhhT)

W ht-1

tanh stack

ht

xt

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

91 May 4, 2017

Vanilla RNN Gradient Flow

h0

h1 x1

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

h2 x2

h3 x3

h4 x4

Computing gradient of h0 involves many factors of W (and repeated tanh) Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

92 May 4, 2017

Vanilla RNN Gradient Flow

h0

h1 x1

Computing gradient of h0 involves many factors of W (and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

h2

h3

x2

x3

h4 x4

Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

93 May 4, 2017

Vanilla RNN Gradient Flow

h0

h1 x1

Computing gradient of h0 involves many factors of W (and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

h2

h3

x2

x3

Largest singular value > 1: Exploding gradients

h4 x4

Gradient clipping: Scale gradient if its norm is too big

Largest singular value < 1: Vanishing gradients

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

94 May 4, 2017

Vanilla RNN Gradient Flow

h0

h1 x1

Computing gradient of h0 involves many factors of W (and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994 Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

h2

h3

x2

x3

h4 x4

Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

Fei-Fei Li & Justin Johnson & Serena Yeung

Change RNN architecture

Lecture 10 -

95 May 4, 2017

Long Short Term Memory (LSTM) Vanilla RNN

LSTM

Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

96 May 4, 2017

Long Short Term Memory (LSTM) f: Forget gate, Whether to erase cell i: Input gate, whether to write to cell g: Gate gate (?), How much to write to cell o: Output gate, How much to reveal cell

[Hochreiter et al., 1997] vector from below (x) x

sigmoid

i

h

sigmoid

f

sigmoid

o

tanh

g

W

vector from before (h)

4h x 2h

4h

Fei-Fei Li & Justin Johnson & Serena Yeung

4*h Lecture 10 -

97 May 4, 2017

Long Short Term Memory (LSTM) [Hochreiter et al., 1997]



ct-1

f i g

W ht-1

stack

o

+

ct



tanh



ht

xt Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

98 May 4, 2017

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]



ct-1

f i g

W ht-1

stack

o

+

ct



tanh



Backpropagation from ct to ct-1 only elementwise multiplication by f, no matrix multiply by W

ht

xt Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 -

99 May 4, 2017

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]

Uninterrupted gradient flow! c0

c1

Fei-Fei Li & Justin Johnson & Serena Yeung

c2

c3

Lecture 10 - 100 May 4, 2017

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]

Uninterrupted gradient flow!

Softmax FC 1000 Pool

Lecture 10 - 101 May 4, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung

3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64

... 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Pool 7x7 conv, 64 / 2 Input

Similar to ResNet!

c3 c2 c1 c0

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997]

Uninterrupted gradient flow! c0

c1

c2

c3

In between: Highway Networks Softmax FC 1000 Pool

Fei-Fei Li & Justin Johnson & Serena Yeung

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

...

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128 / 2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Pool 7x7 conv, 64 / 2 Input

Similar to ResNet!

Srivastava et al, “Highway Networks”, ICML DL Workshop 2015

Lecture 10 - 102 May 4, 2017

Other RNN Variants

[An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015]

GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014]

[LSTM: A Search Space Odyssey, Greff et al., 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 103 May 4, 2017

Summary - RNNs allow a lot of flexibility in architecture design - Vanilla RNNs are simple but don’t work very well - Common to use LSTM or GRU: their additive interactions improve gradient flow - Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM) - Better/simpler architectures are a hot topic of current research - Better understanding (both theoretical and empirical) is needed.

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 104 May 4, 2017

Next time: Midterm! Then Detection and Segmentation

Fei-Fei Li & Justin Johnson & Serena Yeung

Lecture 10 - 105 May 4, 2017