Ewâ Q[L0(fw)] Ewâ Q[bL0(fw)] + s. KL(QkP) + ln m δ. 2(m 1). What if we want to get generalization for a given weight
A PAC-‐Bayesian Approach to Generalization in Deep Learning Behnam Neyshabur Institute for Advanced Study Joint work with Srinadh Bhojanapalli, David McAllester, Nati Srebro
Observations about Neural Nets • Deep networks are over-‐parametrized: #parameters >> #samples • Many global optima • Some of them do not generalize well! • Choice of optimization ⇒ different global minimum ⇒ different generalization
2
Requirements for a complexity measure that explains generalization 𝑤: the parameter vector. 𝑅 𝑤 : complexity measure, ex. 𝑅 𝑤 = 𝑤
1.
%
𝑤 𝑅(𝑤) is small has small capacity, i.e. small R(w) is sufficient for generalization
2. Natural problems can be predicted by 𝑤 𝑅(𝑤) is small 3. The optimization algorithm biases us towards solutions in 𝑤 𝑅(𝑤) is small
3
Outline • From PAC-‐Bayes to Margin • From PAC-‐Bayes to Sharpness • Empirical Investigation of three phenomena: • Fitting random labels (Zhang et al., 2016). • Different global minima (Keskar et al., 2016). • Large networks generalize better (Neyshabur et al., 2015).
Preliminaries • Feedforward nets: fw (x) = Wd (Wd
1
(.... (W1 x)))
• 𝑑 layer • ℎ hidden unit in each layer • ReLU activations 𝜙 𝑥 = max {0,𝑥}
• B bound on ℓ𝓁%-‐norm of 𝑥 • Margin Loss: 𝐿9 𝑓; = 𝑃(=,>) [score of 𝑦 − 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 𝑜𝑡ℎ𝑒𝑟 𝑙𝑎𝑏𝑒𝑙𝑠 ≤ 𝛾] • Misclassification error: 𝐿S 𝑓;
5
Capacity Control • Network Size • The capacity is too high. • Can’t explain any of the phenomena.
• Scale Sensitive Capacity Control: • Scale of the predictor, i.e. weights • Scale of the predictions (Margin or Sharpness)
6
Margin 𝛾= score of the correct label – maximum score of other labels Margin-‐based measures: • • • •
ℓ𝓁% -‐norm with capacity ∝
Y ∏Z W[\ VW X 9Y
ℓ𝓁]-‐path norm with capacity ∝ ℓ𝓁% -‐path norm with capacity ∝ spectral norm with capacity ∝ .
j : Frobenius
norm .
(Neyshabur et al. 2015)
^Y _`ab,\
(Bartlett and Mandelson 2002)
9Y Y c VW _`ab,Y ℎ 9Y Y ∏Z W[\ VW Y 9Y
% : Spectral norm
∑
( Bartlett and Mandelson 2002) h Y VW e c \ , 𝟐 Y fg] VW e Y
( Bartlett et al. 2017)
. k : ℓ𝓁k norm of a vector .
klmn ,k : ℓ𝓁k -‐path norm
7
PAC-‐Bayes Theorem (McAllester 98): For any 𝑃 and any 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 𝑆, for any Q: b 0 (fw )] + Ew⇠Q [L0 (fw )] Ew⇠Q [L
s
KL (QkP ) + ln m 2(m 1)
What if we want to get generalization for a given weight 𝑤? • Consider the distribution over 𝑤 + 𝑢 where 𝑢 is random perturbation. 8
PAC-‐Bayes(2) Theorem: For any 𝑃 and any 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 𝑆, for any w and Q over 𝑢 : b 0 (fw+u )] + Eu⇠Q [L0 (fw+u )] Eu⇠Q [L
s
KL (w + ukP ) + ln m 2(m 1)
9
From margin to PAC-‐Bayes Large margin: small perturbation in parameters will not change the loss.
10
From PAC-‐Bayes to margin Lemma 1: For any 𝑃 and any 𝛾 > 0, 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 𝑆, for any Q over 𝑢 such that ⇥
Pu⇠Q maxx2X |fw+u (x)
we have:
b (fw ) + L0 (fw ) L
s
fw (x)|1
𝑊]+𝑈]
𝑊% +𝑈%
12
Generalization Bound for Neural Nets Theorem: For any 𝛾 > 0, 𝛿 ∈ 0,1 w.p 1 − 𝛿 over the choice of the training set 0v u u d2 h ln(dh)B 2 ⇧d kWi k2 Pd i=1 2 i=1 Bt b L0 (fw ) L (fw ) + O @ 2m
kWi k2F kWi k22
1
+ ln dm C A
Proof idea: Choose prior and posterior both to be independent Gaussian distributions. 13
Experiments on True and Random Labels ℓ𝓁% -‐norm
35
10
30
10
25
1030
10
20
10
true labels random labels
10K 20K 30K 40K 50K
10
25
10K 20K 30K 40K 50K
size of traning set
size of traning set 10
4
ℓ𝓁% -‐path norm
102 10
ℓ𝓁]-‐path norm
10
15
spectral norm
1010
0
10K 20K 30K 40K 50K
size of traning set
10
5
10K 20K 30K 40K 50K
size of traning set 14
Sharpness 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠(𝛼) = max 𝐿 𝑤 + 𝜈 − 𝐿(𝑤) [Keskar et al.17] • ‚ƒ
Similar to margin, controlling sharpness alone is meaningless.
sharpness
1.2
true labels random labels
1 0.8 0.6 0.4
10K
20K
30K
40K
50K
size of traning set 15
From PAC-‐Bayesian to Sharpness • Sharpness can be understood as one of the two terms in the PAC-‐Bayes bound (Dziugaite and Roy 2017). 𝔼• 𝐿(𝑤 + 𝜈) ≤ 𝐿† 𝑤 + 𝔼• 𝐿†(𝑤 + 𝜈) -‐ 𝐿†(𝑤) +
]
𝐾𝐿(𝑤 + 𝜈| 𝑃 + ln
‡
0.30.12
true labels
0.30.12
5K 10K 30K 50K
0.25 0.20.08
0.15
if
‹
𝑃 = 𝑁 0, 𝜎 % 𝜈 ∼ 𝑁 0, 𝜎 %
random labels 5K 10K 30K 50K
0.25 0.20.08
0.15
0.10.04
0.10.04
0.05 0 0 0 2K
; YY %ŒY
expected sharpness expected sharpness
expected sharpness expected sharpness
expected sharpness
%‡
0.05
1
4K 2
KLKL
3 6K #108
0 0 0 2K
1
4K 2
KLKL
3 6K #108 16
Generating Different Global Minima •
We construct different global minima of the training loss for the same data, intentionally with different generalization properties. How?
input
training set CIFAR10 -‐ true labels size 10000
Optimization Find zero error predictor on:
training set ∪
confusion set CIFAR10 -‐ random labels varying sizes
17
Different global minima 1
0.3
0.8
0.25 0.2
0.6
0.15
0.4
0.1
0.2 0 0.18
0.05 0.22
0.26
0.3
0.34
0 0
1
2
3
4
5 107
18
Increasing the Network Size (Number of Hidden Units)
[Neyshabur, Tomioka, Srebro. ICLR’15]
19
0.8 0.6 0.4 0.2 0 32
128
512
2K
8K
expected sharpness
1
expected sharpness
Experiments with varying number of hidden units 0.3
0.3 0.25
0.2
0.2 0.15
0.15 0.1
0.1 0.05
32 128 512 2048
0.25
0.05
0 0
0 0
1
1
KL KL
#10
2 2 6 6#10
20
What we learned • A PAC-‐Bayesian approach to spectrally-‐normalized margin bounds for neural networks • PAC-‐Bayesian theory can partly capture the generalization behavior in deep learning. • How to use these understanding in practice?
21
Optimization is Tied to Choice of Geometry Steepest descent w.r.t. a geometry: 𝑤 (my]) = arg min 𝜂 𝛻L w ;
m
,𝑤 + 𝛿 𝑤
my]
,𝑤
üimprove the objective as much as possible üonly a small change in the model. Examples: • Gradient Descent: Steepest descent w.r.t ℓ𝓁% norm • Coordinate Descent: Steepest descent w.r.t. ℓ𝓁] norm • Path-‐SGD: Steepest descent w.r.t path-‐ℓ𝓁% norm What’s the geometry appropriate for deep networks? 22
Studying the landscape in search of a flat minimum in Alaska…
23