A PAC-‐Bayesian Approach to Generalization in Deep Learning

0 downloads 44 Views 6MB Size Report
Ew⇠Q[L0(fw)] Ew⇠Q[bL0(fw)] + s. KL(QkP) + ln m δ. 2(m 1). What if we want to get generalization for a given weight
A  PAC-­‐Bayesian  Approach  to   Generalization in  Deep  Learning Behnam  Neyshabur Institute  for  Advanced  Study Joint  work  with  Srinadh Bhojanapalli,  David  McAllester,  Nati Srebro

Observations  about  Neural  Nets • Deep  networks  are  over-­‐parametrized: #parameters  >>  #samples • Many  global  optima • Some  of  them  do  not  generalize  well! • Choice  of  optimization  ⇒ different  global  minimum   ⇒ different  generalization

2

Requirements  for  a  complexity  measure   that  explains  generalization 𝑤:  the  parameter  vector. 𝑅 𝑤 :    complexity  measure,  ex.  𝑅 𝑤 = 𝑤

1.

%

𝑤 𝑅(𝑤)  is  small has  small  capacity,  i.e.  small  R(w)  is   sufficient  for  generalization  

2. Natural  problems  can  be  predicted  by   𝑤 𝑅(𝑤)  is  small 3. The  optimization  algorithm  biases  us  towards  solutions  in   𝑤 𝑅(𝑤)  is  small

3

Outline • From  PAC-­‐Bayes  to  Margin • From  PAC-­‐Bayes  to  Sharpness • Empirical  Investigation of  three  phenomena: • Fitting  random  labels  (Zhang   et  al.,  2016). • Different  global  minima  (Keskar et  al.,  2016). • Large  networks  generalize  better  (Neyshabur   et  al.,  2015).

Preliminaries • Feedforward  nets: fw (x) = Wd (Wd

1

(.... (W1 x)))

• 𝑑 layer • ℎ hidden  unit  in  each  layer • ReLU activations  𝜙 𝑥 = max  {0,𝑥}

• B  bound  on  ℓ𝓁%-­‐norm  of  𝑥 • Margin  Loss:           𝐿9 𝑓; = 𝑃(=,>) [score  of  𝑦   − 𝑠𝑐𝑜𝑟𝑒  𝑜𝑓  𝑜𝑡ℎ𝑒𝑟  𝑙𝑎𝑏𝑒𝑙𝑠 ≤ 𝛾] • Misclassification  error:  𝐿S 𝑓;

5

Capacity  Control • Network  Size • The  capacity  is  too  high. • Can’t  explain  any  of  the  phenomena.

• Scale  Sensitive  Capacity  Control: • Scale  of  the  predictor,  i.e.  weights • Scale  of  the  predictions  (Margin  or  Sharpness)

6

Margin 𝛾= score  of  the  correct  label  – maximum  score  of  other  labels Margin-­‐based  measures: • • • •

ℓ𝓁% -­‐norm  with  capacity  ∝

Y ∏Z W[\ VW X 9Y

ℓ𝓁]-­‐path  norm   with  capacity  ∝ ℓ𝓁% -­‐path  norm   with  capacity  ∝ spectral  norm   with  capacity  ∝  .

j :  Frobenius

norm                  .

(Neyshabur  et  al.  2015)

^Y _`ab,\

(Bartlett  and  Mandelson 2002)

9Y Y c VW _`ab,Y ℎ 9Y Y ∏Z W[\ VW Y 9Y

% :  Spectral  norm                



( Bartlett  and  Mandelson 2002) h Y VW e c \  ,    𝟐 Y fg] VW e Y

( Bartlett  et  al.  2017)

 . k :  ℓ𝓁k norm  of  a  vector                .

klmn ,k :  ℓ𝓁k -­‐path  norm

7

PAC-­‐Bayes Theorem  (McAllester 98):  For   any  𝑃 and  any  𝛿 ∈ 0,1 w.p   1 − 𝛿 over   the  choice  of   the   training   set  𝑆, for   any  Q: b 0 (fw )] + Ew⇠Q [L0 (fw )]  Ew⇠Q [L

s

KL (QkP ) + ln m 2(m 1)

What  if  we  want  to  get  generalization  for  a  given  weight  𝑤? • Consider  the  distribution  over  𝑤 + 𝑢 where  𝑢 is  random   perturbation. 8

PAC-­‐Bayes(2) Theorem:  For  any  𝑃 and  any  𝛿 ∈ 0,1 w.p  1 − 𝛿 over   the  choice  of  the  training  set  𝑆,  for  any  w  and  Q  over  𝑢 : b 0 (fw+u )] + Eu⇠Q [L0 (fw+u )]  Eu⇠Q [L

s

KL (w + ukP ) + ln m 2(m 1)

9

From  margin  to  PAC-­‐Bayes Large  margin:  small  perturbation  in  parameters  will  not  change  the  loss.

10

From  PAC-­‐Bayes  to  margin Lemma  1:  For  any  𝑃 and  any  𝛾 > 0, 𝛿 ∈ 0,1 w.p   1 − 𝛿 over  the  choice  of  the  training  set  𝑆,  for  any  Q   over  𝑢 such  that   ⇥

Pu⇠Q maxx2X |fw+u (x)

we  have:

b (fw ) + L0 (fw )  L

s

fw (x)|1
𝑊]+𝑈]

𝑊% +𝑈%

12

Generalization  Bound  for  Neural  Nets Theorem:  For  any  𝛾 > 0, 𝛿 ∈ 0,1 w.p  1 − 𝛿 over   the  choice  of  the  training  set 0v u u d2 h ln(dh)B 2 ⇧d kWi k2 Pd i=1 2 i=1 Bt b L0 (fw )  L (fw ) + O @ 2m

kWi k2F kWi k22

1

+ ln dm C A

Proof   idea:  Choose  prior   and  posterior   both  to  be  independent   Gaussian   distributions. 13

Experiments  on  True  and  Random  Labels ℓ𝓁% -­‐norm  

35

10

30

10

25

1030

10

20

10

true labels random labels

10K 20K 30K 40K 50K

10

25

10K 20K 30K 40K 50K

size of traning set

size of traning set 10

4

ℓ𝓁% -­‐path  norm  

102 10

ℓ𝓁]-­‐path  norm  

10

15

spectral  norm

1010

0

10K 20K 30K 40K 50K

size of traning set

10

5

10K 20K 30K 40K 50K

size of traning set 14

Sharpness 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠(𝛼) = max 𝐿 𝑤 + 𝜈 − 𝐿(𝑤) [Keskar  et  al.17] • ‚ƒ

Similar  to  margin,   controlling   sharpness   alone  is  meaningless.

sharpness

1.2

true labels random labels

1 0.8 0.6 0.4

10K

20K

30K

40K

50K

size of traning set 15

From  PAC-­‐Bayesian  to  Sharpness • Sharpness  can  be  understood  as  one  of  the  two  terms  in   the  PAC-­‐Bayes  bound (Dziugaite and  Roy  2017). 𝔼• 𝐿(𝑤 + 𝜈) ≤ 𝐿† 𝑤 + 𝔼• 𝐿†(𝑤 + 𝜈) -­‐ 𝐿†(𝑤)  +  

]

𝐾𝐿(𝑤 + 𝜈| 𝑃 + ln



0.30.12

true  labels

0.30.12

5K 10K 30K 50K

0.25 0.20.08

0.15

if



𝑃 = 𝑁 0, 𝜎 % 𝜈 ∼ 𝑁 0, 𝜎 %

random   labels 5K 10K 30K 50K

0.25 0.20.08

0.15

0.10.04

0.10.04

0.05 0 0 0 2K

; YY %ŒY

expected sharpness expected sharpness

expected sharpness expected sharpness

expected  sharpness

%‡

0.05

1

4K 2

KLKL

3 6K #108

0 0 0 2K

1

4K 2

KLKL

3 6K #108 16

Generating  Different  Global  Minima •

We  construct  different  global   minima  of  the  training  loss  for  the  same   data,  intentionally  with  different  generalization  properties.  How?

input

training  set CIFAR10  -­‐ true  labels size  10000

Optimization Find  zero  error  predictor  on:

training  set ∪

confusion  set CIFAR10  -­‐ random  labels varying  sizes

17

Different  global  minima 1

0.3

0.8

0.25 0.2

0.6

0.15

0.4

0.1

0.2 0 0.18

0.05 0.22

0.26

0.3

0.34

0 0

1

2

3

4

5 107

18

Increasing  the  Network  Size (Number  of  Hidden  Units)

[Neyshabur,  Tomioka,  Srebro.  ICLR’15]

19

0.8 0.6 0.4 0.2 0 32

128

512

2K

8K

expected sharpness

1

expected sharpness

Experiments  with  varying number  of  hidden  units 0.3

0.3 0.25

0.2

0.2 0.15

0.15 0.1

0.1 0.05

32 128 512 2048

0.25

0.05

0 0

0 0

1

1

KL KL

#10

2 2 6 6#10

20

What  we  learned • A  PAC-­‐Bayesian  approach  to  spectrally-­‐normalized   margin  bounds  for  neural  networks • PAC-­‐Bayesian  theory  can  partly  capture  the   generalization  behavior  in  deep  learning. • How  to  use  these  understanding  in  practice?

21

Optimization  is  Tied  to Choice  of  Geometry Steepest  descent  w.r.t.  a  geometry: 𝑤 (my]) = arg  min 𝜂 𝛻L w ;

m

,𝑤 + 𝛿 𝑤

my]

,𝑤

üimprove  the  objective  as  much  as  possible üonly  a  small  change  in  the  model. Examples: • Gradient  Descent:  Steepest  descent  w.r.t ℓ𝓁% norm • Coordinate  Descent:  Steepest  descent    w.r.t.  ℓ𝓁] norm • Path-­‐SGD:  Steepest  descent  w.r.t path-­‐ℓ𝓁% norm What’s  the  geometry  appropriate  for  deep  networks? 22

Studying the landscape in search of a flat minimum in Alaska…

23

Suggest Documents