Jun 29, 2018 - Internal Covariate Shift. 3 Why BatchNorm really works. Theoretical Results. Is BatchNorm the best (and only) way to smoothen the loss.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Santurkar, Shibani et al., 2018 Agustinus Kristiadi
SDA Reading Group, 29 June 2018
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)
Outline 1 Introduction
BatchNorm 2 Debunking BatchNorm’s claim
Distributional Stability Internal Covariate Shift 3 Why BatchNorm really works
Theoretical Results Is BatchNorm the best (and only) way to smoothen the loss landscape? 4 Conclusion 5 References
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Introduction BatchNorm
BatchNorm
Whitening activations of neural networks, make them distributed to N (0, I ). Activation statistics is computed over mini-batch of data. One of the breakthrough in Deep Learning. Allowing more effective and robust training. Ioffe and Szegedy argued this works because BN reduce ”Internal Covariate Shift” (ICS). They defined ICS loosely as the change in the distribution of network activations due to change in network parameters during training.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Introduction BatchNorm
BatchNorm: Theoretical Foundation?
How and why BatchNorm works is still not well understood. Not many (if at all) works try to analyze BN theoretically. In this paper, we are asking two questions: Is the effectiveness of BatchNorm indeed related to internal covariate shift (ICS)? 2 Is BN’s stabilization of layer input distributions even effective in reducing ICS? 1
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim
Experiment setup
1
VGGNet, CIFAR10.
2
25-layer deep linear network, on synthetic Gaussian data.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim Distributional Stability
Distributional Stability Does stability of activations’ distribution really helps?
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim Internal Covariate Shift
Internal Covariate Shift (ICS)
Definition (Internal Covariate Shift) (t)
(t)
Let L be the loss, W1 , . . . , Wk be the parameters and (x (t) , y (t) ) be batch of input-label pairs used to train the network at time t. We define Internal Covariate Shift (ICS) of activation i 0 k , where: at time t to be the difference kGt,i − Gt,i 2 (t)
(t)
Gt,i = ∇W (t) L(W1 , . . . , Wk ; x (t) , y (t) ) i
(t+1)
0 Gt,i = ∇W (t) L(W1 i
(t+1)
(t)
(t)
(t)
, . . . , Wi−1 , Wi , Wi+1 , . . . , Wk ; x (t) , y (t) )
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim Internal Covariate Shift
Internal Covariate Shift (ICS)
That is, Gt,i is gradient of layer params as usual in backprop at time t. 0 is gradient of layer params when previous layers’ params Gt,i already updated.
The difference between them is then the change in the optimization landscape of Wi caused by the changes of its input. 0 has small If ICS is reduced, then by definition Gt,i and Gt,i difference =⇒ more correlated.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Debunking BatchNorm’s claim Internal Covariate Shift
Internal Covariate Shift (ICS)
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Theoretical Results
Smoothing effect of BatchNorm This paper shows that BN improves Lipschitz-ness
1
of:
2
the loss landscape. the gradients (β-smooth).
3
Smoother landscape enables us to use larger step size, prevents exploding and vanishing gradient, and robusts to initialization. Additional remark: Interesting to see how the curvature looks like with BN. Connection to natural gradient?
1
Assumption: analysis is done only in deep linear network with single BatchNorm layer inserted at some point. 2 f is K-Lipschitz if |f (x1 ) − f (x2 )| ≤ K kx1 − x2 k 3 f is β-smooth if its gradients are β-Lipschitz
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Theoretical Results
Smoothness of loss landscape with BN
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Is BatchNorm the best (and only) way to smoothen the loss landscape?
Is BatchNorm the best (and only) way to smoothen the loss landscape?
Experiment: Shift the mean as in BN, but use different scaling. Use `p norm instead of standard deviation. This way, we do not have any distributional argument as in BN.
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Why BatchNorm really works Is BatchNorm the best (and only) way to smoothen the loss landscape?
Smoothing loss landscape without BN
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) Conclusion
Conclusion
BatchNorm works very well empirically, but we still do not know why. This paper verified (or debunked) the original BN claims: distribution stability and internal covariate shift are not why BN works. With some assumption, this paper showed BN improves Lipschitz-ness of loss landscape and the gradients. BN is not the only way to smoothen the loss landscape. Interesting to think about the connection with natural gradient?
How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) References
References and Further Readings
Santurkar, Shibani, et al., “How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)”, 2018 (pre-print) https://arxiv.org/abs/1805.11604