A PAC-âBayesian Approach to Generalization in Deep Learning

A PAC-‐Bayesian Approach to Generalization in Deep Learning Behnam Neyshabur Institute for Advanced Study Joint work with Srinadh Bhojanapalli, David McAllester, Nati Srebro

Observations about Neural Nets • Deep networks are over-‐parametrized: #parameters >> #samples • Many global optima • Some of them do not generalize well! • Choice of optimization ⇒ different global minimum ⇒ different generalization

2

Requirements for a complexity measure that explains generalization 𝑤: the parameter vector. 𝑅 𝑤 : complexity measure, ex. 𝑅 𝑤 = 𝑤

1.

%

𝑤 𝑅(𝑤) is small has small capacity, i.e. small R(w) is sufficient for generalization

2. Natural problems can be predicted by 𝑤 𝑅(𝑤) is small 3. The optimization algorithm biases us towards solutions in 𝑤 𝑅(𝑤) is small

3

Outline • From PAC-‐Bayes to Margin • From PAC-‐Bayes to Sharpness • Empirical Investigation of three phenomena: • Fitting random labels (Zhang et al., 2016). • Different global minima (Keskar et al., 2016). • Large networks generalize better (Neyshabur et al., 2015).

Preliminaries • Feedforward nets: fw (x) = Wd (Wd

1

(.... (W1 x)))

• 𝑑 layer • ℎ hidden unit in each layer • ReLU activations 𝜙 𝑥 = max {0,𝑥}

• B bound on ℓ𝓁%-‐norm of 𝑥 • Margin Loss: 𝐿9 𝑓; = 𝑃(=,>) [score of 𝑦 − 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑜𝑡ℎ𝑒𝑟 𝑙𝑎𝑏𝑒𝑙𝑠 ≤ 𝛾] • Misclassification error: 𝐿S 𝑓;

5

Capacity Control • Network Size • The capacity is too high. • Can’t explain any of the phenomena.

• Scale Sensitive Capacity Control: • Scale of the predictor, i.e. weights • Scale of the predictions (Margin or Sharpness)

6

Margin 𝛾= score of the correct label – maximum score of other labels Margin-‐based measures: • • • •

ℓ𝓁% -‐norm with capacity ∝

Y ∏Z W[\ VW X 9Y

ℓ𝓁]-‐path norm with capacity ∝ ℓ𝓁% -‐path norm with capacity ∝ spectral norm with capacity ∝ .

j : Frobenius

norm .

(Neyshabur et al. 2015)

^Y _`ab,\

(Bartlett and Mandelson 2002)

9Y Y c VW _`ab,Y ℎ 9Y Y ∏Z W[\ VW Y 9Y

% : Spectral norm

∑

( Bartlett and Mandelson 2002) h Y VW e c \ , 𝟐 Y fg] VW e Y

( Bartlett et al. 2017)

. k : ℓ𝓁k norm of a vector .

klmn ,k : ℓ𝓁k -‐path norm

7

PAC-‐Bayes Theorem (McAllester 98): For any 𝑃 and any 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 𝑆, for any Q: b 0 (fw )] + Ew⇠Q [L0 (fw )]  Ew⇠Q [L

s

KL (QkP ) + ln m 2(m 1)

What if we want to get generalization for a given weight 𝑤? • Consider the distribution over 𝑤 + 𝑢 where 𝑢 is random perturbation. 8

PAC-‐Bayes(2) Theorem: For any 𝑃 and any 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 𝑆, for any w and Q over 𝑢 : b 0 (fw+u )] + Eu⇠Q [L0 (fw+u )]  Eu⇠Q [L

s

KL (w + ukP ) + ln m 2(m 1)

9

From margin to PAC-‐Bayes Large margin: small perturbation in parameters will not change the loss.

10

From PAC-‐Bayes to margin Lemma 1: For any 𝑃 and any 𝛾 > 0, 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 𝑆, for any Q over 𝑢 such that ⇥

Pu⇠Q maxx2X |fw+u (x)

we have:

b (fw ) + L0 (fw )  L

s

fw (x)|1
𝑊]+𝑈]

𝑊% +𝑈%

12

Generalization Bound for Neural Nets Theorem: For any 𝛾 > 0, 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 0v u u d2 h ln(dh)B 2 ⇧d kWi k2 Pd i=1 2 i=1 Bt b L0 (fw )  L (fw ) + O @ 2m

kWi k2F kWi k22

1

+ ln dm C A

Proof idea: Choose prior and posterior both to be independent Gaussian distributions. 13

Experiments on True and Random Labels ℓ𝓁% -‐norm

35

10

30

10

25

1030

10

20

10

true labels random labels

10K 20K 30K 40K 50K

10

25

10K 20K 30K 40K 50K

size of traning set

size of traning set 10

4

ℓ𝓁% -‐path norm

102 10

ℓ𝓁]-‐path norm

10

15

spectral norm

1010

0

10K 20K 30K 40K 50K

size of traning set

10

5

10K 20K 30K 40K 50K


Sharpness 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠(𝛼) = max 𝐿 𝑤 + 𝜈 − 𝐿(𝑤) [Keskar et al.17] • ‚ƒ

Similar to margin, controlling sharpness alone is meaningless.

sharpness

1.2

true labels random labels

1 0.8 0.6 0.4

10K

20K

30K

40K

50K


From PAC-‐Bayesian to Sharpness • Sharpness can be understood as one of the two terms in the PAC-‐Bayes bound (Dziugaite and Roy 2017). 𝔼• 𝐿(𝑤 + 𝜈) ≤ 𝐿† 𝑤 + 𝔼• 𝐿†(𝑤 + 𝜈) -‐ 𝐿†(𝑤) +

]

𝐾𝐿(𝑤 + 𝜈| 𝑃 + ln

‡

0.30.12

true labels

0.30.12

5K 10K 30K 50K

0.25 0.20.08

0.15

if

‹

𝑃 = 𝑁 0, 𝜎 % 𝜈 ∼ 𝑁 0, 𝜎 %

random labels 5K 10K 30K 50K

0.25 0.20.08

0.15

0.10.04

0.10.04

0.05 0 0 0 2K

; YY %ŒY

expected sharpness expected sharpness

expected sharpness expected sharpness

expected sharpness

%‡

0.05

1

4K 2

KLKL

3 6K #108

0 0 0 2K

1

4K 2

KLKL

3 6K #108 16

Generating Different Global Minima •

We construct different global minima of the training loss for the same data, intentionally with different generalization properties. How?

input

training set CIFAR10 -‐ true labels size 10000

Optimization Find zero error predictor on:

training set ∪

confusion set CIFAR10 -‐ random labels varying sizes

17

Different global minima 1

0.3

0.8

0.25 0.2

0.6

0.15

0.4

0.1

0.2 0 0.18

0.05 0.22

0.26

0.3

0.34

0 0

1

2

3

4

5 107

18

Increasing the Network Size (Number of Hidden Units)

[Neyshabur, Tomioka, Srebro. ICLR’15]

19

0.8 0.6 0.4 0.2 0 32

128

512

2K

8K

expected sharpness

1

expected sharpness

Experiments with varying number of hidden units 0.3

0.3 0.25

0.2

0.2 0.15

0.15 0.1

0.1 0.05

32 128 512 2048

0.25

0.05

0 0

0 0

1

1

KL KL

#10

2 2 6 6#10

20

What we learned • A PAC-‐Bayesian approach to spectrally-‐normalized margin bounds for neural networks • PAC-‐Bayesian theory can partly capture the generalization behavior in deep learning. • How to use these understanding in practice?

21

Optimization is Tied to Choice of Geometry Steepest descent w.r.t. a geometry: 𝑤 (my]) = arg min 𝜂 𝛻L w ;

m

,𝑤 + 𝛿 𝑤

my]

,𝑤

üimprove the objective as much as possible üonly a small change in the model. Examples: • Gradient Descent: Steepest descent w.r.t ℓ𝓁% norm • Coordinate Descent: Steepest descent w.r.t. ℓ𝓁] norm • Path-‐SGD: Steepest descent w.r.t path-‐ℓ𝓁% norm What’s the geometry appropriate for deep networks? 22

Studying the landscape in search of a flat minimum in Alaska…

23

A PAC-âBayesian Approach to Generalization in Deep Learning

A PAC-âBayesian Approach to Generalization in Deep Learning

Suggest Documents