Qualitatively Characterizing Neural Network Optimization ... - ICLR

1 downloads 131 Views 5MB Size Report
2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe. 2. Traditional view of NN tr
Qualitatively Characterizing Neural Network Optimization Problems

Ian Goodfellow

Oriol Vinyals

1

Andrew Saxe

Traditional view of NN training

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

2

Factored linear view (Cartoon of Saxe et al 2013’s worldview)

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

3

Attractive saddle point view (Cartoon of Dauphin et al 2014’s worldview)

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

4

Questions

• Does SGD get stuck in local minima? • Does SGD get stuck on saddle points? • Does SGD wind around numerous bumpy obstacles? • Does SGD thread a twisting canyon?

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

5

History written by the winners • Visualize trajectories of (near) SOTA results • Selection bias: looking at success • Failure is interesting, but hard to attribute to optimization • Careful with interpretation • SGD never encounters X? • SGD fails if it encounters X?

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

6

2-D subspace visualization

Cost

Projection 1 Projection 0 2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

7

A special 1-D subspace Interpolation plot

Learning curve

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

8

2-parameter deep linear model 2

1.5

Gradient Manifold of global minima Saddle point Gradient descent path Linear path Interpolation between minima

4.5

4

1 3.5

w2

0.5

3

2.5

0

2 −0.5 1.5 −1 1

−1.5 0.5

−2 −2

−1.5

−1

−0.5

0 w1

0.5

1

1.5

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

2

9

Maxout / MNIST experiment

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

10

Other activation functions

The linear interpolation curves for fully connected networks with different Left) Sigmoid activation function. Right) ReLU activation function. 2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

11

eere 5:we Here use interpolation linear interpolation to search forminima. local minima. Left) By interpola usewe linear to search for local Left) By interpolating be SGD solutions, we that showeach thatsolution each solution is a different local minimu ntifferent SGD solutions, we show is a different local minimum with ubspace. If we interpolate between a random space SGD ce. Right)Right) If we interpolate between a random point inpoint spaceinand an and SGDansolutio local minima thesolution, SGD solution, suggesting thatminima local minima lominima besidesbesides the SGD suggesting that local are rare.are rare.

Convolutional network

A small barrier

he interpolation experiment for a convolutional maxoutmaxout networknetwork on the CIFA e 6:linear The linear interpolation experiment for a convolutional on t & Hinton, 2009).2009). Left) At a global scale, the curve well-beh etzhevsky (Krizhevsky & Hinton, Left) At a global scale, thelooks curvevery looks very w med in nearinthe initial point, we seewe there a shallow barrier barrier that SGD ) Zoomed near the initial point, seeisthere is a shallow thatmust SGDnavig mu 2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

12

ear this anomaly.

LSTM

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

13

MP-DBM

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

14

3-D Visualization

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

15

3-D MP-DBM visualization

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

16

Published as a conference paper at ICLR 2015

Random walk control

2015 International Conference Representation Learning Goodfellow, Vinyals, and Saxe Figure 11: Plots of theon projection along the --axis from initialization to

17 solution versus the norm

3-D Plots Without Obstacles

Adversarial ReLUs LSTM

Factored Linear

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

18

3-D Plot of Adversarial Maxout

SGD naturally exploits negative curvature!

Obstacles!

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

19

Conclusion •

For most problems, there exists a linear subspace of monotonically decreasing values



For some problems, there are obstacles between this subspace the SGD path



Factored linear models capture many qualitative aspects of deep network training



See more visualizations at our poster / demo / paper

2015 International Conference on Representation Learning --- Goodfellow, Vinyals, and Saxe

20

Suggest Documents