How Does Batch Normalization Help Optimization?

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Santurkar, Shibani et al., 2018 Agustinus Kristiadi

SDA Reading Group, 29 June 2018

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)

Outline 1 Introduction

BatchNorm 2 Debunking BatchNorm’s claim

Distributional Stability Internal Covariate Shift 3 Why BatchNorm really works

Theoretical Results Is BatchNorm the best (and only) way to smoothen the loss landscape? 4 Conclusion 5 References

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Introduction BatchNorm

BatchNorm

Whitening activations of neural networks, make them distributed to N (0, I ). Activation statistics is computed over mini-batch of data. One of the breakthrough in Deep Learning. Allowing more effective and robust training. Ioffe and Szegedy argued this works because BN reduce ”Internal Covariate Shift” (ICS). They defined ICS loosely as the change in the distribution of network activations due to change in network parameters during training.

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Introduction BatchNorm

BatchNorm: Theoretical Foundation?

How and why BatchNorm works is still not well understood. Not many (if at all) works try to analyze BN theoretically. In this paper, we are asking two questions: Is the effectiveness of BatchNorm indeed related to internal covariate shift (ICS)? 2 Is BN’s stabilization of layer input distributions even effective in reducing ICS? 1

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim

Experiment setup

1

VGGNet, CIFAR10.

2

25-layer deep linear network, on synthetic Gaussian data.

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim Distributional Stability

Distributional Stability Does stability of activations’ distribution really helps?

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim Internal Covariate Shift

Internal Covariate Shift (ICS)

Definition (Internal Covariate Shift) (t)

(t)

Let L be the loss, W1 , . . . , Wk be the parameters and (x (t) , y (t) ) be batch of input-label pairs used to train the network at time t. We define Internal Covariate Shift (ICS) of activation i 0 k , where: at time t to be the difference kGt,i − Gt,i 2 (t)

(t)

Gt,i = ∇W (t) L(W1 , . . . , Wk ; x (t) , y (t) ) i

(t+1)

0 Gt,i = ∇W (t) L(W1 i

(t+1)

(t)

(t)

(t)

, . . . , Wi−1 , Wi , Wi+1 , . . . , Wk ; x (t) , y (t) )



That is, Gt,i is gradient of layer params as usual in backprop at time t. 0 is gradient of layer params when previous layers’ params Gt,i already updated.

The difference between them is then the change in the optimization landscape of Wi caused by the changes of its input. 0 has small If ICS is reduced, then by definition Gt,i and Gt,i difference =⇒ more correlated.



How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Theoretical Results

Smoothing effect of BatchNorm This paper shows that BN improves Lipschitz-ness

1

of:

2

the loss landscape. the gradients (β-smooth).

3

Smoother landscape enables us to use larger step size, prevents exploding and vanishing gradient, and robusts to initialization. Additional remark: Interesting to see how the curvature looks like with BN. Connection to natural gradient?

1

Assumption: analysis is done only in deep linear network with single BatchNorm layer inserted at some point. 2 f is K-Lipschitz if |f (x1 ) − f (x2 )| ≤ K kx1 − x2 k 3 f is β-smooth if its gradients are β-Lipschitz

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Theoretical Results

Smoothness of loss landscape with BN

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Is BatchNorm the best (and only) way to smoothen the loss landscape?

Is BatchNorm the best (and only) way to smoothen the loss landscape?

Experiment: Shift the mean as in BN, but use different scaling. Use `p norm instead of standard deviation. This way, we do not have any distributional argument as in BN.

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Is BatchNorm the best (and only) way to smoothen the loss landscape?

Smoothing loss landscape without BN

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Conclusion

Conclusion

BatchNorm works very well empirically, but we still do not know why. This paper verified (or debunked) the original BN claims: distribution stability and internal covariate shift are not why BN works. With some assumption, this paper showed BN improves Lipschitz-ness of loss landscape and the gradients. BN is not the only way to smoothen the loss landscape. Interesting to think about the connection with natural gradient?

How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) References

References and Further Readings

Santurkar, Shibani, et al., “How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)”, 2018 (pre-print) https://arxiv.org/abs/1805.11604

How Does Batch Normalization Help Optimization?

How Does Batch Normalization Help Optimization?

Suggest Documents