2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe. 2. Traditional view of NN tr
Qualitatively Characterizing Neural Network Optimization Problems
Ian Goodfellow
Oriol Vinyals
1
Andrew Saxe
Traditional view of NN training
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
2
Factored linear view (Cartoon of Saxe et al 2013’s worldview)
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
3
Attractive saddle point view (Cartoon of Dauphin et al 2014’s worldview)
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
4
Questions
• Does SGD get stuck in local minima? • Does SGD get stuck on saddle points? • Does SGD wind around numerous bumpy obstacles? • Does SGD thread a twisting canyon?
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
5
History written by the winners • Visualize trajectories of (near) SOTA results • Selection bias: looking at success • Failure is interesting, but hard to attribute to optimization • Careful with interpretation • SGD never encounters X? • SGD fails if it encounters X?
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
6
2-D subspace visualization
Cost
Projection 1 Projection 0 2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
7
A special 1-D subspace Interpolation plot
Learning curve
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
8
2-parameter deep linear model 2
1.5
Gradient Manifold of global minima Saddle point Gradient descent path Linear path Interpolation between minima
4.5
4
1 3.5
w2
0.5
3
2.5
0
2 −0.5 1.5 −1 1
−1.5 0.5
−2 −2
−1.5
−1
−0.5
0 w1
0.5
1
1.5
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
2
9
Maxout / MNIST experiment
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
10
Other activation functions
The linear interpolation curves for fully connected networks with different Left) Sigmoid activation function. Right) ReLU activation function. 2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
11
eere 5:we Here use interpolation linear interpolation to search forminima. local minima. Left) By interpola usewe linear to search for local Left) By interpolating be SGD solutions, we that showeach thatsolution each solution is a different local minimu ntifferent SGD solutions, we show is a different local minimum with ubspace. If we interpolate between a random space SGD ce. Right)Right) If we interpolate between a random point inpoint spaceinand an and SGDansolutio local minima thesolution, SGD solution, suggesting thatminima local minima lominima besidesbesides the SGD suggesting that local are rare.are rare.
Convolutional network
A small barrier
he interpolation experiment for a convolutional maxoutmaxout networknetwork on the CIFA e 6:linear The linear interpolation experiment for a convolutional on t & Hinton, 2009).2009). Left) At a global scale, the curve well-beh etzhevsky (Krizhevsky & Hinton, Left) At a global scale, thelooks curvevery looks very w med in nearinthe initial point, we seewe there a shallow barrier barrier that SGD ) Zoomed near the initial point, seeisthere is a shallow thatmust SGDnavig mu 2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
12
ear this anomaly.
LSTM
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
13
MP-DBM
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
14
3-D Visualization
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
15
3-D MP-DBM visualization
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
16
Published as a conference paper at ICLR 2015
Random walk control
2015 International Conference Representation Learning Goodfellow, Vinyals, and Saxe Figure 11: Plots of theon projection along the --axis from initialization to
17 solution versus the norm
3-D Plots Without Obstacles
Adversarial ReLUs LSTM
Factored Linear
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
18
3-D Plot of Adversarial Maxout
SGD naturally exploits negative curvature!
Obstacles!
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
19
Conclusion •
For most problems, there exists a linear subspace of monotonically decreasing values
•
For some problems, there are obstacles between this subspace the SGD path
•
Factored linear models capture many qualitative aspects of deep network training
•
See more visualizations at our poster / demo / paper
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe
20