Nov 29, 2016 - Optimization: Direct Method. First order derivative ... Optimization: Iterative Method. Gradient Descent
Deep Neural Networks Introduction, Architectures and Implementations
Najeeb Khan
[email protected] usask.ca/~najeeb.khan
November 29, 2016
Outline 1
Introduction Hard Problems Datasets Neural Networks
2
Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting
3
Implementations TensorFlow Keras Hardware
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
2 / 60
This Presentation
c Image Source: Ralph A. Clevenger
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
3 / 60
Outline 1
Introduction Hard Problems Datasets Neural Networks
2
Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting
3
Implementations TensorFlow Keras Hardware
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
4 / 60
Machine Learning Tribes Analogizers
Bayesians
Symbolists
Machine Learning Evolutionaries Connectionists
Graphic inspired by first few chapters of the Master Algorithm (Domingos, 2015)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
5 / 60
Traditional Machine Learning
Image Source: http://scikit-learn.org
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
6 / 60
Hard Problems: Animal or Food?
Image Source: Karen Zack, twitter.com/teenybiscuit
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
7 / 60
Hard Problems: Animal or Food?
Image Source: Karen Zack, twitter.com/teenybiscuit
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
7 / 60
Hard Problems: Animal or Food?
Image Source: Karen Zack, twitter.com/teenybiscuit
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
7 / 60
Hard Problems: Animal or Food?
Image Source: Karen Zack, twitter.com/teenybiscuit
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
7 / 60
Hard Problems: Animal or Food?
Image Source: Karen Zack, twitter.com/teenybiscuit
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
7 / 60
Hard Problems: Autonomous Formula One?
Image Source: RoboRace: http://roborace.com
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
8 / 60
Hard Problems: Should I?
Image Source: OpenReview: ICLR 2017
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
9 / 60
Hard Problems: Should I?
Image Source: OpenReview: ICLR 2017
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
9 / 60
Datasets
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
10 / 60
Datasets
YouTube-8M Dataset I I
8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
10 / 60
Datasets
YouTube-8M Dataset I I
8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m
Yahoo News Feed Dataset I I
User- news item interaction 20M users, 1.5 TB text data
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
10 / 60
Datasets
YouTube-8M Dataset I I
8 Million videos, 56 years in duration Available for download: research.google.com/youtube8m
Yahoo News Feed Dataset I I
User- news item interaction 20M users, 1.5 TB text data
Plant Pictures Dataset I I I
2 TB of drone images 800 GB of time-lapse images 1 GB of sensor data
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
10 / 60
Datasets
MNIST database of handwritten digits (LeCun et al., 1998)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
11 / 60
Datasets MNIST database of handwritten digits (LeCun et al., 1998)
Image Source: (Deng, 2012)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
11 / 60
Datasets MNIST database of handwritten digits (LeCun et al., 1998)
Training set: 60,000 examples Test set: 10,000 examples State-of-the-art classification error rate: 0.21% (Wan et al., 2013).
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
11 / 60
Datasets
Canadian Institute for Advanced Research-10 Dataset (CIFAR-10) (Krizhevsky and Hinton, 2009)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
12 / 60
Datasets Canadian Institute for Advanced Research-10 Dataset (CIFAR-10) (Krizhevsky and Hinton, 2009)
60000 32x32 color images in 10 classes 6000 images per class State-of-the-art classification error rate: 3.47% (Graham, 2014).
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
12 / 60
Artificial Neuron x1 x2
w1
w2
v = wx
Σ
a
a = ϕ(v ) =
w3
1 (1 + e −v )
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
13 / 60
Artificial Neuron x1 x2
w1
w2
v = wx
Σ
a
a = ϕ(v ) =
w3
1 (1 + e −v )
x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
13 / 60
Artificial Neuron x1 x2
w1
w2
v = wx
Σ
a
a = ϕ(v ) =
w3
1 (1 + e −v )
x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
13 / 60
Artificial Neuron x1 x2
w1
w2
v = wx
Σ
a
a = ϕ(v ) =
w3
1 (1 + e −v )
x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
13 / 60
Artificial Neuron x1 x2
w1
w2
v = wx
Σ
a
a = ϕ(v ) =
w3
1 (1 + e −v )
x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
13 / 60
Artificial Neuron x1 x2
w1
w2
v = wx
Σ
a
a = ϕ(v ) =
w3
1 (1 + e −v )
x3 i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w=np . random . r a n d ( 1 , 3 ) v=w . d o t ( x ) a =1/( 1 + np . exp (−v ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
13 / 60
Neural Networks: Representation
Σϕ x1
Σϕ
x2
Σϕ
x3
Σϕ
v1 = W21 x Σϕ
yˆ1
Σϕ
yˆ2
v2 = W32 a1
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
Σϕ
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Representation
Σϕ x1
Σϕ
x2
Σϕ
x3
Σϕ
v1 = W21 x Σϕ
yˆ1
Σϕ
yˆ2
v2 = W32 a1
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
Σϕ
i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Representation
Σϕ x1
Σϕ
x2
Σϕ
x3
Σϕ
v1 = W21 x Σϕ
yˆ1
Σϕ
yˆ2
v2 = W32 a1
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
Σϕ
i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Representation
Σϕ x1
Σϕ
x2
Σϕ
x3
Σϕ
v1 = W21 x Σϕ
yˆ1
Σϕ
yˆ2
v2 = W32 a1
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
Σϕ
i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Representation
Σϕ x1
Σϕ
x2
Σϕ
x3
Σϕ
v1 = W21 x Σϕ
yˆ1
Σϕ
yˆ2
v2 = W32 a1
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
Σϕ
i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Representation
Σϕ x1
Σϕ
x2
Σϕ
x3
Σϕ
v1 = W21 x Σϕ
yˆ1
Σϕ
yˆ2
v2 = W32 a1
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
Σϕ
i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Representation
x1 x2 x3
I as am es fo igno r s ri im ng pl bi ici ty !
Σϕ
v1 = W21 x
Σϕ
Σϕ
yˆ1
Σϕ
yˆ2
Σϕ
v2 = W32 a1
Σϕ Σϕ
1 (1 + e −v1 ) 1 a2 = ϕ(v1 ) = (1 + e −v2 )
a1 =
ϕ(v1 ) =
yˆ = a2
i m p o r t numpy a s np x=np . a r r a y ( [ [ 1 ] , [ 2 ] , [ 3 ] ] ) w21=r a n d ( 5 , 3 ) ; w32=r a n d ( 2 , 5 ) V1=w . d o t ( x ) ; a1 =1/( 1 + np . exp (−V1 ) ) V2=w . d o t ( a1 ) ; a2 =1/( 1 + np . exp (−V2 ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
14 / 60
Neural Networks: Prediction Logistic Regression x1 w 1 x 2 w2 Σ 3 w
1 , (1 + e −wx ) p(y = 0|x) = 1 − p(y = 1|x), p(y = 1|x) =
p (1|x)
x3
y ∗ = arg max p(c = i|x) i
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
15 / 60
Neural Networks: Prediction Logistic Regression x1 w 1
1 , (1 + e −wx ) p(y = 0|x) = 1 − p(y = 1|x), p(y = 1|x) =
x 2 w2 Σ 3
p (1|x)
w
x3
y ∗ = arg max p(c = i|x) i
Softmax Regression x1
Σ
v1
x2
Σ
v2
x3
Σ
v3
Najeeb Khan (BIGLAB)
e vi p(y = i|x) = PJ
j=1 e
vj
,
y ∗ = arg max p(y = i|x) i
Deep Neural Networks
November 29, 2016
15 / 60
Neural Networks: Learning
x1 x2
w1
w2
Σ
y (a > 0.5)
w3
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier
x1 x2
w1
w2
Σ
y (a > 0.5)
w3
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
y (a > 0.5)
w3
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
y (a > 0.5)
Give me weights w so that if I see x the uncertainty in predicting t is as little as possible
w3
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
w3
y (a > 0.5)
Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x)
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
w3
y (a > 0.5)
Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx))
x3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
w3
x3
Najeeb Khan (BIGLAB)
y (a > 0.5)
Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
w3
x3
y (a > 0.5)
Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t J = −t log(ϕ (wx))−(1−t) log(1−ϕ (wx))
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Neural Networks: Learning
We will discuss learning weights for a logistic regression classifier Training example (x, t) x1 x2
w1
w2
Σ
w3
x3
y (a > 0.5)
Give me weights w so that if I see x the uncertainty in predicting t is as little as possible Max p (y = t|x) = Ber (y = t|ϕ (wx)) p (y = t|x) = ϕ (wx)t (1 − ϕ (wx))1−t J = −t log(ϕ (wx))−(1−t) log(1−ϕ (wx)) w∗ = arg min J
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
16 / 60
Optimization: Direct Method
First order derivative test
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Direct Method
First order derivative test ∂J = 0 for ∂wk
Najeeb Khan (BIGLAB)
k = 1...K
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Direct Method
First order derivative test ∂J = 0 for ∂wk
k = 1...K
∇Jw = 0
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Direct Method
First order derivative test ∂J = 0 for ∂wk
k = 1...K
∇Jw = 0 Closed form solution
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Direct Method
First order derivative test ∂J = 0 for ∂wk
k = 1...K
∇Jw = 0 Closed form solution I
Not available for most cases
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Direct Method
First order derivative test ∂J = 0 for ∂wk
k = 1...K
∇Jw = 0 Closed form solution I I
Not available for most cases Expensive for large/sparse problems
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Direct Method
First order derivative test ∂J = 0 for ∂wk
k = 1...K
∇Jw = 0 Closed form solution I I I
Not available for most cases Expensive for large/sparse problems Model specific, e.g., ∗ = (X T X )−1 X T y for linear regression
Najeeb Khan (BIGLAB)
w
Deep Neural Networks
November 29, 2016
17 / 60
Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn
∇J = −2.25 n = 0 η = 0.5
Najeeb Khan (BIGLAB)
Deep Neural Networks
w = −4.5
November 29, 2016
18 / 60
Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I
∇J is computed using a single training example
∇J = −2.25 n = 0 η = 0.5
Najeeb Khan (BIGLAB)
Deep Neural Networks
w = −4.5
November 29, 2016
18 / 60
Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I
∇J is computed using a single training example
Batch Gradient Descent I
∇J is computed using the whole training set
∇J = −2.25 n = 0 η = 0.5
Najeeb Khan (BIGLAB)
Deep Neural Networks
w = −4.5
November 29, 2016
18 / 60
Optimization: Iterative Method Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I
∇J is computed using a single training example
Batch Gradient Descent I
∇J is computed using the whole training set
Mini-batch Gradient Descent I
∇J is based on subsets of the training set
∇J = −2.25 n = 0 η = 0.5
w = −4.5
Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I
∇J is computed using a single training example
S m ingl a ve x e la x re ye (R gr r en ess log ni io is e, n tic 20 los a 05 ses nd ) ar So e ft co n-
Optimization: Iterative Method
Batch Gradient Descent I
∇J is computed using the whole training set
Mini-batch Gradient Descent I
∇J is based on subsets of the training set
∇J = −2.25 n = 0 η = 0.5
w = −4.5
Gradient Descent ∆wn = η∇J wn+1 = wn − ∆wn Stochastic Gradient Descent I
∇J is computed using a single training example
Batch Gradient Descent I
∇J is computed using the whole training set
Mini-batch Gradient Descent I
∇J is based on subsets of the training set
Najeeb Khan (BIGLAB)
S m ingl a ve x e la r Bu x (R egr yer en ess log ca t ni io is fo n‘t sing e, n tic od te le 20 los a ap ll p lay 05 ses nd ar et er ) ar So t! an s e ft co d n-
Optimization: Iterative Method
∇J = −2.25 n = 0 η = 0.5
Deep Neural Networks
w = −4.5
November 29, 2016
18 / 60
NN without a Hidden Layer Linear boundary
Demo created with playground.tensorflow.org
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
19 / 60
NN without a Hidden Layer Under-fitting a non-linear boundary
Demo created with playground.tensorflow.org
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
19 / 60
Multi-layer Neural Networks
Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
Demo created with playground.tensorflow.org
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights!
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I
W x)
Output of the hidden layer ϕ (
Najeeb Khan (BIGLAB)
21
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I
W x W W x )))
Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ (
Najeeb Khan (BIGLAB)
Deep Neural Networks
21
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I
W x W W x W W x
Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 ))))
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I I
W x W W x W W x
Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 )))) Backpropagation uses chain rule to find the derivative of J with respect to any weight w in the network.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Multi-layer Neural Networks Universal Approximation Theorem (Cybenko, 1989) I
A hidden-layer neural network can represent any decision boundary given enough neurons and proper weights
How do we get proper weights? We don‘t have ∇wh J with respect to the hidden layer weights! Backpropagation Algorithm I I I I
W x W W x W W x
Output of the hidden layer ϕ ( 21 ) Output of the final layer ϕ ( 32 (ϕ ( 21 ))) Cost function J (t, ϕ ( 32 (ϕ ( 21 )))) Backpropagation uses chain rule to find the derivative of J with respect to any weight w in the network.
The objective function J is no longer guaranteed to be convex or concave
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
20 / 60
Gradient Descent: Low Learning Rate
∇J = −18.54518 w = −3.3 n = 0 η = 0.005
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
21 / 60
Gradient Descent: High Learning Rate
∇J = −18.54518 w = −3.3 n = 0 η = 0.1
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
22 / 60
Stochastic Gradient Descent Variants
Momentum
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
23 / 60
Stochastic Gradient Descent Variants
Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
23 / 60
Stochastic Gradient Descent Variants
Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn
Adaptive learning rates (Senior et al., 2013)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
23 / 60
Stochastic Gradient Descent Variants
Momentum ∆wn = η∇J + m∆wn−1 wn+1 = wn − ∆wn
Adaptive learning rates (Senior et al., 2013) I I I I I
Exponential adaptation η0 × 10−n/τ AdaGrad AdaDelta RMSProp Adam
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
23 / 60
Stochastic Gradient Descent Variants
Image Source: Alec Radford http://imgur.com/a/Hqolp
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
24 / 60
Stochastic Gradient Descent Variants
Image Source: Alec Radford http://imgur.com/a/Hqolp
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
24 / 60
Stochastic Gradient Descent Variants
Image Source: Alec Radford http://imgur.com/a/Hqolp
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
24 / 60
Impact of ML Research
“... we have implemented similar gradient update rules adapted to our clusters and they successfully improved our baselines. The resulted models have been or will be launched online and one fifth of the world’s population would benefit from these improved models.” — A Reviewer of Deep learning with Elastic Averaging SGD
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
25 / 60
Outline 1
Introduction Hard Problems Datasets Neural Networks
2
Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting
3
Implementations TensorFlow Keras Hardware
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
26 / 60
Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers
Adapted from (Dean, 2016)
Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers
e
g ed
Adapted from (Dean, 2016)
r to c te de
Deep Learning To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers
sha
pe
e
g ed
det
ect
or
r to c te de
Adapted from (Dean, 2016)
Deep Learning
sha
pe
e
g ed
det
ect
or
r to c te de
Adapted from (Dean, 2016)
abstract features
To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers
Deep Learning
recognition
sha
pe
e
g ed
det
ect
or
r to c te de
abstract features
To learn complicated functions one may need deep architectures, e.g, neural networks with a large number of hidden layers
Adapted from (Dean, 2016)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
27 / 60
Deep Learning Architectures Convo Nets
Unsup Learning
Deep Learning
Recursive Nets
Reinforced Learning
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
28 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009) Problems with deep architectures
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009) Problems with deep architectures I
We don’t have enough labeled data
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W x ))
∇W 21 J (t, ϕ (
Najeeb Khan (BIGLAB)
Deep Neural Networks
21
November 29, 2016
29 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem ∇W 21 J (t, ϕ (
Najeeb Khan (BIGLAB)
W
32
Deep Neural Networks
(ϕ (
W x )))) 21
November 29, 2016
29 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W
∇W 21 J (t, ϕ (
Najeeb Khan (BIGLAB)
W
43 ϕ (
Deep Neural Networks
32
W x )))))
(ϕ (
21
November 29, 2016
29 / 60
Unsupervised Learning
A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W
∇W 21 J (t, ϕ (
Najeeb Khan (BIGLAB)
W
54 ϕ (
W
43 ϕ (
Deep Neural Networks
32
W x ))))))
(ϕ (
21
November 29, 2016
29 / 60
Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W
∇W 21 J (t, ϕ (
W
54 ϕ (
W
43 ϕ (
32
W x ))))))
(ϕ (
21
∆wn = η∇J wn+1 = wn − ∆wn
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W
∇W 21 J (t, ϕ (
W
54 ϕ (
W
43 ϕ (
32
W x ))))))
(ϕ (
21
∆wn = η∇J wn+1 = wn − ∆wn
Unsupervised learning
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W
∇W 21 J (t, ϕ (
W
54 ϕ (
W
43 ϕ (
32
W x ))))))
(ϕ (
21
∆wn = η∇J wn+1 = wn − ∆wn
Unsupervised learning I
Utilizes unlabeled data
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Learning A good first reference is (Bengio, 2009) Problems with deep architectures I I
We don’t have enough labeled data Gradient vanishing problem
W
∇W 21 J (t, ϕ (
W
54 ϕ (
W
43 ϕ (
32
W x ))))))
(ϕ (
21
∆wn = η∇J wn+1 = wn − ∆wn
Unsupervised learning I I
Utilizes unlabeled data Uses layerwise pre-training
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
29 / 60
Unsupervised Pre-training
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
30 / 60
Unsupervised Pre-training There is more information available in the form of data than in the form of labels
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
30 / 60
Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
30 / 60
Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
30 / 60
Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
30 / 60
Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines Input layer
Output layer Hidden layer
x1
x ˆ1
x2
x ˆ2
x3
x ˆ3
x4
x ˆ4
x5
x ˆ5
x6
x ˆ6
Autoencoder
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
30 / 60
Unsupervised Pre-training There is more information available in the form of data than in the form of labels Learn a good initialization of the weights using the unlabeled data Fine-tune the weights with labeled data Two popular models used for pre-initialization of weights are autoencoder and restricted Boltzmann machines Input layer
Output layer Hidden layer
x1
Visible layer Hidden layer
x ˆ1
x1
x2
x ˆ2
x2
h1
x3
x ˆ3
x3
h2
x4
x ˆ4
x4
h3
x5
x ˆ5
x5
h4
x6
x ˆ6
x6
Autoencoder
Najeeb Khan (BIGLAB)
RBM
Deep Neural Networks
November 29, 2016
30 / 60
Layer-wise Pre-training Training a deep network Input layer
x1 x2 x3
Hidden layer 1
Hidden layer 2
Hidden layer 3
Output layer
W21 W32
W43
W54
x4
yˆ1
x5
yˆ2
x6 x7 x8
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
31 / 60
Layer-wise Pre-training Training a deep network Input layer
x1 x2 x3
Hidden layer 1
Hidden layer 2
Hidden layer 3
Output layer
W21 W32
W43
W54
x4
yˆ1
x5
yˆ2
x6 x7 x8
Raw input pixels
Hidden features α
Reconstructed input
x1
x ˆ1
x2
x ˆ2
x3
x ˆ3
x4
x ˆ4
x5
x ˆ5
x6
x ˆ6
x7
x ˆ7
x8
x ˆ8
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
31 / 60
Layer-wise Pre-training Training a deep network Input layer
x1
Hidden layer 1
Hidden layer 2
Hidden layer 3
Output layer
W21 W32
x2 x3
W43
W54
x4
yˆ1
x5
yˆ2
x6 x7 x8
Raw input pixels
Hidden features α
Reconstructed input
Input features α
Hidden features β
Reconstructed α
x1
x ˆ1
x2
x ˆ2
α1
α ˆ1
x3
x ˆ3
α2
α ˆ2
x4
x ˆ4
α3
α ˆ3
x5
x ˆ5
α4
α ˆ4
x6
x ˆ6
α5
α ˆ5
x7
x ˆ7
α6
α ˆ6
x8
x ˆ8
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
31 / 60
Layer-wise Pre-training Training a deep network Input layer
x1
Hidden layer 1
Hidden layer 2
Output layer
Hidden layer 3
W21 W32
x2 x3
W43
W54
x4
yˆ1
x5
yˆ2
x6 x7 x8
Raw input pixels
Hidden features α
Reconstructed input
Input features α
Hidden features β
Reconstructed α
x1
x ˆ1
x2
x ˆ2
α1
α ˆ1
x3
x ˆ3
α2
α ˆ2
β1
βˆ1
x4
x ˆ4
α3
α ˆ3
β2
βˆ2
x5
x ˆ5
α4
α ˆ4
β3
βˆ3
x6
x ˆ6
α5
α ˆ5
β4
βˆ4
x7
x ˆ7
α6
α ˆ6
x8
x ˆ8
Najeeb Khan (BIGLAB)
Input features β
Deep Neural Networks
Hidden features γ
Reconstructed β
November 29, 2016
31 / 60
Layer-wise Pre-training Training a deep network Input layer
x1
Hidden layer 1
Hidden layer 2
Output layer
Hidden layer 3
W21 W32
x2 x3
W43
W54
x4
yˆ1
x5
yˆ2
x6 x7 x8
Raw input pixels
Hidden features α
Reconstructed input
Input features α
Hidden features β
Reconstructed α
x1
x ˆ1
x2
x ˆ2
α1
α ˆ1
x3
x ˆ3
α2
α ˆ2
β1
βˆ1
x4
x ˆ4
α3
α ˆ3
β2
βˆ2
x5
x ˆ5
α4
α ˆ4
β3
βˆ3
x6
x ˆ6
α5
α ˆ5
β4
βˆ4
x7
x ˆ7
α6
α ˆ6
x8
x ˆ8
Najeeb Khan (BIGLAB)
Input features β
Deep Neural Networks
Hidden features γ
Hidden features γ
Reconstructed β
Output
γ1 yˆ1 γ2 yˆ2 γ3
November 29, 2016
31 / 60
Layer-wise Pre-training Training a deep network Input layer
x1
Hidden layer 1
Hidden layer 2
Output layer
Hidden layer 3
W21 W32
x2 x3
W43
W54
x4
yˆ1
x5
yˆ2
x6 x7 x8
Raw input pixels
Hidden features α
Reconstructed input
Input features α
Hidden features β
Reconstructed α
x1
x ˆ1
x2
x ˆ2
α1
α ˆ1
x3
x ˆ3
α2
α ˆ2
β1
βˆ1
x4
x ˆ4
α3
α ˆ3
β2
βˆ2
x5
x ˆ5
α4
α ˆ4
β3
βˆ3
x6
x ˆ6
α5
α ˆ5
β4
βˆ4
x7
x ˆ7
α6
α ˆ6
x8
x ˆ8
Najeeb Khan (BIGLAB)
Input features β
Deep Neural Networks
Hidden features γ
Hidden features γ
Reconstructed β
Output
γ1 yˆ1 γ2 yˆ2 γ3
November 29, 2016
31 / 60
Layer-wise Pre-training Training a deep network Input layer
Hidden layer 1
Hidden layer 2
Output layer
Hidden layer 3
W21 W32
x2 x3
W43
W54
x4
F la ine b e tu led ne sa wi m th pl es
x1
yˆ1
x5
yˆ2
x6 x7 x8
Raw input pixels
Hidden features α
Reconstructed input
Input features α
Hidden features β
Reconstructed α
x1
x ˆ1
x2
x ˆ2
α1
α ˆ1
x3
x ˆ3
α2
α ˆ2
β1
βˆ1
x4
x ˆ4
α3
α ˆ3
β2
βˆ2
x5
x ˆ5
α4
α ˆ4
β3
βˆ3
x6
x ˆ6
α5
α ˆ5
β4
βˆ4
x7
x ˆ7
α6
α ˆ6
x8
x ˆ8
Najeeb Khan (BIGLAB)
Input features β
Deep Neural Networks
Hidden features γ
Hidden features γ
Reconstructed β
Output
γ1 yˆ1 γ2 yˆ2 γ3
November 29, 2016
31 / 60
Varients of Autoencoders (Bengio et al., 2013)
Regularized Autoencoders Sparse Autoencoders Stacked Denoising Autoencoders Contractive Autoencoders
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
32 / 60
Convolutional Neural Networks (Karpathy, 2016)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision Inspired by the animal visual cortex
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer?
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer? 64x64x3x512 ≈ 6 Million
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks (Karpathy, 2016)
Developed for solving problems in computer vision Inspired by the animal visual cortex Parameter sharing makes them easier to train Input size 64x64x3, let’s say we train a hidden layer with 512 neurons How many parameters do we need to train in a fully connected layer? 64x64x3x512 ≈ 6 Million If we use 10 kernels of size 5x5x3, we have 750 parameters to learn
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
33 / 60
Convolutional Neural Networks
Input Layer
Convolutional Neural Networks
Input Layer
Conv Layer
Convolutional Neural Networks
Input Layer
Conv Layer
Pooling Layer
Convolutional Neural Networks
Input Layer
Conv Layer
Pooling Layer
Fully connected Layer
Convolutional Neural Networks
Input Layer
Conv Layer
Pooling Layer
Fully connected Layer
Softmax SVM ...
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
34 / 60
Convolutional Neural Networks Assumes local connectivity Generates activation maps using convolution operation instead of dot products 1D convolution (w ∗ x)n =
X
wm xn−m
m kern
Activation map activ(n) = ReLU((w ∗ x)n )
(1)
where ReLU(z) = max(0, z).
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
35 / 60
Convolutional Neural Networks Raw pixels
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks
ReLU((w ∗ x))
Raw pixels
Activation maps
Convolutional Neural Networks Activation maps
Pooling
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Convolutional Neural Networks
Najeeb Khan (BIGLAB)
Activation maps
Pooling Features
max (activ)
ReLU((w ∗ x))
Raw pixels
Deep Neural Networks
Repeat
November 29, 2016
36 / 60
GoogLeNet softmax2
SoftmaxActivation
FC
AveragePool 7x7+1(V)
DepthConcat
Conv 1x1+1(S)
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
DepthConcat
Conv 1x1+1(S)
Conv 3x3+1(S)
Conv 1x1+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
softmax1
Conv 1x1+1(S)
MaxPool 3x3+1(S)
SoftmaxActivation
MaxPool 3x3+2(S)
FC
DepthConcat
Conv 1x1+1(S)
FC
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
AveragePool 5x5+3(V)
DepthConcat
Conv 1x1+1(S)
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
DepthConcat
Conv 1x1+1(S)
softmax0
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
SoftmaxActivation
FC
DepthConcat
Conv 1x1+1(S)
FC
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
AveragePool 5x5+3(V)
DepthConcat
Conv 1x1+1(S)
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
MaxPool 3x3+2(S)
DepthConcat
Conv 1x1+1(S)
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
DepthConcat
Conv 1x1+1(S)
Conv 3x3+1(S)
Conv 5x5+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
Conv 1x1+1(S)
MaxPool 3x3+1(S)
MaxPool 3x3+2(S)
LocalRespNorm
Conv 3x3+1(S)
Conv 1x1+1(V)
LocalRespNorm
MaxPool 3x3+2(S)
Conv 7x7+2(S)
input
Figure – GoogLeNet network with all the bells and whistles (Szegedy et al., 2015) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
37 / 60
Residual Neural Networks x weight layer
relu
F (x)
weight layer
F (x) + x
x identity
relu
Figure – Residual Neural Networks (He et al., 2015)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
38 / 60
Residual Neural Networks 34-layer plain
x weight layer
relu
F (x)
weight layer
F (x) + x
x
image
7x7 conv, 64, /2
pool, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
identity
3x3 conv, 64
relu
34-layer residual
image
7x7 conv, 64, /2
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
avg pool
fc 1000
fc 1000
Figure – Residual Neural Networks (He et al., 2015)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
38 / 60
Residual Neural Networks 34-layer plain
x weight layer
relu
F (x)
x
weight layer
F (x) + x
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
relu
20
image
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
identity
20
56-layer
5
0 0
20-layer
110-layer
5
plain-20 plain-32 plain-44 plain-56 1
error (%)
error (%)
20-layer
10
2
3
iter. (1e4)
4
5
6
0 0
1
2
3
iter. (1e4)
4
5
6
3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
10
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128
ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110
34-layer residual
image
7x7 conv, 64, /2
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
avg pool
fc 1000
fc 1000
Figure – Residual Neural Networks (He et al., 2015)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
38 / 60
Recurrent Neural Networks (Lipton et al., 2015)
Modeling sequential and variable length data, e.g, what is happening in a video?
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
39 / 60
Recurrent Neural Networks (Lipton et al., 2015)
Modeling sequential and variable length data, e.g, what is happening in a video? Extend feedforward neural networks with recursive edges
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
39 / 60
Recurrent Neural Networks (Lipton et al., 2015)
y Modeling sequential and variable length data, e.g, what is happening in a video? Extend feedforward neural networks with recursive edges Training is performed using Back-propagation Through Time (BPTT) x
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
39 / 60
RNN Unfolding
yt0
xt0
RNN Unfolding
yt0
yt1
xt0
xt1
RNN Unfolding
yt0
yt1
yt2
xt0
xt1
xt2
RNN Unfolding
yt0
yt1
yt2
yt3
xt0
xt1
xt2
xt3
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
40 / 60
Long Short Term Memory (LSTM) RNNs cannot model long term dependencies
Source: Christopher Olah, http://colah.github.io
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
41 / 60
Long Short Term Memory (LSTM)
Source: Christopher Olah, http://colah.github.io
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
41 / 60
Long Short Term Memory (LSTM)
Source: Christopher Olah, http://colah.github.io
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
41 / 60
Long Short Term Memory (LSTM)
Source: Christopher Olah, http://colah.github.io
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
41 / 60
Long Short Term Memory (LSTM)
Source: Christopher Olah, http://colah.github.io
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
41 / 60
Long Short Term Memory (LSTM)
Source: Christopher Olah, http://colah.github.io
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
41 / 60
Overfitting
Demo created with playground.tensorflow.org
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
42 / 60
Overfitting: Do we have enough data?
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images?
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels?
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100
∼
106 1030
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting: Do we have enough data?
Consider we are classifying binary images of size 10 × 10 We have a dataset containing 1 million labeled images Number of possible images? 2100 Fraction of possible images for which we have labels? 106 2100
∼
106 1030
∼ 10−24
0.000, 000, 000, 000, 000, 000, 000, 1%
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
43 / 60
Overfitting
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Don’t train too much I
Stop when the validation error starts ascending
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Don’t train too much I
Stop when the validation error starts ascending
Use some form of regularization I I
W
Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g,
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Don’t train too much I
Stop when the validation error starts ascending
Use some form of regularization I I
W
Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014),
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Don’t train too much I
Stop when the validation error starts ascending
Use some form of regularization I I
W
Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013),
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Don’t train too much I
Stop when the validation error starts ascending
Use some form of regularization I I
W
Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013), ShakeOut (Kang et al., 2016) etc.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Overfitting Overcoming over-fitting Get more data I I
Use crowd-sourcing, e.g., Amazon MTurk Use data-augmentation
Don’t train too much I
Stop when the validation error starts ascending
Use some form of regularization I I
I
W
Penalize weights, e.g., add | |2 to J Damage neurons in innovative ways, e.g, DropOut (Srivastava et al., 2014), DropConnect (Wan et al., 2013), ShakeOut (Kang et al., 2016) etc. Induce noise into your model
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
44 / 60
Outline 1
Introduction Hard Problems Datasets Neural Networks
2
Architectures Unsupervised Learning Convolutional Neural Networks Recurrent Neural Networks Overfitting
3
Implementations TensorFlow Keras Hardware
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
45 / 60
Implementations
Image Source: http://imgur.com/ZfkhOt4
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
46 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015).
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I
Nodes in the graph are called ops
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I
A tensor is an n-dimensional array with a rank, shape and type.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I I
A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I I
A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]
Executes graphs in the context of sessions.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I I
A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]
Executes graphs in the context of sessions. I
A session places the graph ops onto devices such as CPUs/GPUs etc.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I I
A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]
Executes graphs in the context of sessions. I
A session places the graph ops onto devices such as CPUs/GPUs etc.
Maintains state with variables.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I I
A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]
Executes graphs in the context of sessions. I
A session places the graph ops onto devices such as CPUs/GPUs etc.
Maintains state with variables. I
Typically represent the parameters of a statistical model as a set of variables
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms (Abadi et al., 2015). Represents computations as graphs. I I
Nodes in the graph are called ops Edges are tensors
Represents data as tensors. I I
A tensor is an n-dimensional array with a rank, shape and type. For example [batch, height, width, channels]
Executes graphs in the context of sessions. I
A session places the graph ops onto devices such as CPUs/GPUs etc.
Maintains state with variables. I
Typically represent the parameters of a statistical model as a set of variables
Uses feeds and fetches to get data into and out of arbitrary operations. Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
47 / 60
TensorFlow Example I
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example I import t e n s o r f l o w as t f # D e f i n e two c o n s t a n t s matrix1 = t f . constant ( [ [ 3 . , 3 . ] ] ) matrix2 = t f . constant ( [ [ 2 . ] , [ 2 . ] ] ) # D e f i n e a matmul o p e r a t i o n p r o d u c t = t f . matmul ( m a t r i x 1 , m a t r i x 2 ) # Launch t h e d e f a u l t g r a p h . sess = tf . Session () # Run t h e matmul o p e r a t i o n r e s u l t = s e s s . run ( product ) print ( result ) [[ 12.]] # Close the Session . sess . close ()
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
48 / 60
TensorFlow Example II
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
TensorFlow Example II s t a t e = t f . V a r i a b l e ( 0 . 0 , name=” c o u n t e r ” ) inc = tf . placeholder ( tf . float32 ) n e w v a l u e = t f . add ( s t a t e , i n c ) update = t f . a s s i g n ( state , new value ) in it op = tf . i n i t i a l i z e a l l v a r i a b l e s () # Launch t h e g r a p h with t f . S e s s i o n () as s e s s : w i t h t f . d e v i c e ( ” / gpu : 0 ” ) : # Run t h e i n i t op s e s s . run ( i n i t o p ) # Run t h e op t h a t u p d a t e s s t a t e for in range ( 3 ) : s e s s . r u n ( [ u p d a t e ] , f e e d d i c t ={ i n c : 0 . 5 } ) p r i n t ( s e s s . run ( s t a t e ) )
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
49 / 60
Keras Example Feedforward Network
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network from k e r a s . m o d e l s i m p o r t S e q u e n t i a l from k e r a s . l a y e r s i m p o r t Dense , Dropout , A c t i v a t i o n from k e r a s . o p t i m i z e r s i m p o r t SGD model = S e q u e n t i a l ( ) model . add ( Dense ( 6 4 , i n p u t d i m =20 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 6 4 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ t a n h ’ ) ) model . add ( Dropout ( 0 . 5 ) ) model . add ( Dense ( 1 0 , i n i t = ’ u n i f o r m ’ ) ) model . add ( A c t i v a t i o n ( ’ s o f t m a x ’ ) ) Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
50 / 60
Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
51 / 60
Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
51 / 60
Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
51 / 60
Keras Example Feedforward Network s g d = SGD( l r =0.1 , d e c a y=1e −6, momentum =0.9 , n e s t e r o v=True ) model . c o m p i l e ( l o s s= ’ c a t e g o r i c a l c r o s s e n t r o p y ’ , o p t i m i z e r=sgd , m e t r i c s =[ ’ a c c u r a c y ’ ] ) model . f i t ( X t r a i n , y t r a i n , n b e p o c h =20 , b a t c h s i z e =16) s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , b a t c h s i z e =16)
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
51 / 60
Hardware NVIDIA GTX TITAN X CUDA Cores 3072 11 teraflops Core Clock 1000 MHz Price ≈ $1000
source: http://www.geforce.com
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
52 / 60
Hardware NVIDIA DGX-1 CUDA Cores 8x 3584 170 teraflops Price ≈ $130,000
source: http://www.geforce.com
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
53 / 60
Hardware
Department of CS Skorpio University of Saskatchewan HPC Plato Zeno Meton
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
54 / 60
Want to learn more?
Available free of cost at: deeplearningbook.org
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
55 / 60
Questions?
Acknowledgment
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
57 / 60
Acknowledgment
Dr. Kevin Stanley
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
57 / 60
Acknowledgment
Dr. Kevin Stanley
Najeeb Khan (BIGLAB)
Dr. Ian Stavness
Deep Neural Networks
November 29, 2016
57 / 60
Acknowledgment
Dr. Kevin Stanley
Dr. Ian Stavness
Dr. Jung Lee
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
57 / 60
Acknowledgment
Dr. Kevin Stanley
Dr. Ian Stavness
Dr. Jung Lee
Dr. Jawad Shah
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
57 / 60
Citation
@unpublished{najeeb2016dnn, Author = {Khan, Najeeb}, Institution = {University of Saskatchewan}, Year = {2016}, Title = {Deep Neural Networks: Introduction, Architectures and Implementations} URL = {usask.ca/~najeeb.khan/docs/dnn2016.pdf} }
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
58 / 60
References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., et al. (2015). Tensorflow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow. org, 1. R in Machine Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends Learning, 2(1):1–127.
Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314. Dean, J. (2016). Large-scale deep learning for intelligent computer systems. Web Search and Data Mining. Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142. Domingos, P. (2015). The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books. Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.
Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
59 / 60
References (cont.) Kang, G., Li, J., and Tao, D. (2016). Shakeout: A new regularized deep neural network training scheme. In Thirtieth AAAI Conference on Artificial Intelligence. Karpathy, A. (2016). CS231n: Convolutional Neural Networks for Visual Recognition. http://cs231n.stanford.edu/ [Accessed: October 20, 2016]. Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. LeCun, Y., Cortes, C., and Burges, C. J. (1998). The mnist database of handwritten digits. Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019. Rennie, J. D. (2005). Regularized logistic regression is strictly convex. Unpublished manuscript. URL people. csail. mit. edu/jrennie/writing/convexLR. pdf. Senior, A., Heigold, G., Yang, K., et al. (2013). An empirical study of learning rates in deep neural networks for speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6724–6728. IEEE. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9. Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066. Najeeb Khan (BIGLAB)
Deep Neural Networks
November 29, 2016
60 / 60