Optimizing Neural Networks with Kronecker-factored

Recommend Documents

Optimizing Feed Forward Neural Networks Using Biogeography ...

Jan 22, 2016 - geography Based Optimization for E-Mail Spam Identification. ... The email spam is commonly used for advertising products and services ...

Quantized Neural Networks: Training Neural Networks with Low ...

Sep 22, 2016 - Department of Computer Science and Department of Statistics. UniversitÃ© de ... The QNN code is available online. 1 ..... The ADAM learning method (Kingma and Ba, 2014b) also reduces the impact of the weight scale.

Optimization with Neural Networks

In the state-of-the-art neural approach the discrete elementary decisions (not necessarily binary) are represented by continuous Potts mean- eld neurons, inter-.

Dynamical Nonlinear Neural Networks with

14. Dynamical Nonlinear Neural Networks with. Perturbations Modeling and Global Robust Stability. Analysis. Gamal A. Elnashar. Automatic Control Center.

Fingerprint Classification with Neural Networks

[email protected]. DÃbio Leandro Borges. Universidade Federal de GoiÃ¢nia. Escola de Engenharia ElÃ©trica. PraÃ§a UniversitÃ¡ria s/n - Setor UniversitÃ¡rio, ...

CONVOLUTIONAL NEURAL NETWORKS WITH BINAURAL

Nov 16, 2017 - convolutional neural network, binaural representations, harmonic- ..... cay, and mini-batch size were set to 0.02, 0.0001, and 128, respec- tively.

NEURAL NETWORKS WITH FEW MULTIPLICATIONS

stochastically binarized using an approach we call binary connect or ternary ..... agencies for research funding and computing support: Samsung, NSERC,.

Learning Polynomials with Neural Networks

algorithm will learn it. This is equivalent to saying that gradient descent on a neural network with a single node (or layer) can successfully learn a linear function.

Neural networks with multidimensional transfer

Index TermsâInitial value problem (IVP), orbits, oscillators, ... numerical solution of the initial value problem where ... expected if we reduce the norm of the principal truncation error where ... side properties such as symplectiness, [7, p. 312

a Hybrid Approach for Optimizing Neural Networks ... - Semantic Scholar

Internet: [email protected], Tel.: +49-721-608-4211. Abstract ... each o spring and evolution for coarse optimization steps of the network topology. Therefore, our ...

optimizing number of hidden neurons in neural networks - CiteSeerX

input space. 1. â. â. X with additive noise at an unknown level. An approximation FË is obtained using NN with a certain number of hidden neurons. The error ...

Optimizing Long Short-Term Memory Recurrent Neural Networks

Oct 10, 2017 - University of North Dakota, Grand Forks, North Dakota 58202 ... on ant colony optimization (ACO) to develop and enhance the LSTM cell structure of ... it is not supposed to exceed resonance limits as to not destroy the engines [1]. ...

Optimizing Filter Size in Convolutional Neural Networks for Facial ...

Jul 26, 2017 - Shizhong Han1, Zibo Meng1, James O'Reilly1, Jie Cai1, Xiaofeng Wang2, Yan Tong1 ...... [10] S. Han, Z. Meng, P. Liu, and Y. Tong. Facial grid ...

Optimizing the Multilayer Feed-Forward Artificial Neural Networks

Networks Architecture and Training Parameters using. Genetic Algorithm ... This paper presents a new methodology for the optimization of ANN parameters as it ...

Optimizing Memory Usage In N-tuple Neural Networks

program has been written, using the network development environment ORTHANC [21] which compares the performance of the proposed system with that of the ...

Neural networks

Outline. ♢ Brains. ♢ Neural networks. ♢ Perceptrons. ♢ Multilayer perceptrons. ♢ Applications of neural networks. Chapter 20, Section 5. 2 ...

Neural Networks

to return the Program and Documentation, unused, to The MathWorks, Inc. Trademarks. MATLAB and Simulink are registered trademarks of The MathWorks, Inc.

Neural Networks

networks in bioprocessing and chemical engineering. This study includes the neural network .... C. Neural Network Architecture . ..... 2.5 An Illustration of Developing a Neural Network Model Using a Commercial Software. Package on ...

Generating Text with Recurrent Neural Networks

Recurrent Neural Networks (RNNs) form an expressive model family for ... extremely deep neural network with weight sharing across time, the same HF ...

ImageNet Classification with Deep Convolutional Neural Networks

We trained a large, deep convolutional neural network to classify the 1.2 million ... neural network, which has 60 million parameters and 650,000 neurons, ...

Note with exercises about Neural Networks

Course on Artificial Intelligence and Intelligent Systems. Example and exercise using an. Excel based Neural Network package. Henning Christiansen. Roskilde ...

Neural Photo Editing with Introspective Adversarial Networks

Nov 10, 2016 - GANs learn a generative model by training one ... simple: a user selects a paint brush size and color (as with a typical image .... We take advantage of the auto-encoding nature of our architecture and implement the MADE.

Learning population of spiking neural networks with

Aug 9, 2013 - system such as spiking neural network. Section IV introduces the idea of learning in population of networks. Sec. V presents validation of the ...

Deep Learning with Dense Random Neural Networks

A denial-of-service attack (DoS attack) is typically accomplished by flooding the targeted ... TCP SYN attacks : This type of attacks exploits a flaw in some imple-.

Optimizing Neural Networks with Kronecker-factored

Download PDF

0 downloads 0 Views 853KB Size Report

Comment

May 25, 2018 - O(n2) storage. â» O(n3) inversion. â» n can be as high as 20M in VGGNet! â· Needs an approximation! â» AdaGrad, RMSProp, Adam: diagonal ...

Optimizing Neural Networks with Kronecker-factored Approximate Curvature Martens and Grosse, ICML 2015 Agustinus Kristiadi

SDA Reading Group, 25 May 2018

Outline Introduction Gradient Descent Natural Gradient Kronecker Product Kronecker Factored Approximate Curvature (K-FAC) Method Cost Results Conclusion References

Gradient Descent

I

Neural nets are mostly trained using gradient descent.

I

Gradient descent is first order method, so no curvature information.

Gradient Descent: Parametrization I

Gradient descent operates in Euclidean space with Euclidean metric.

I

Projecting complicated manifold in flat representation (left), which distort shape, distance, etc.

I

Training suffers from badly conditioned curvature.

Fisher Information Matrix I

Covariance of gradient of log-likelihood: F := E[∇θ log pθ ∇θ log pθ| ] = Cov(∇θ log pθ )

I

Equivalent to Hessian of KL-divergence of two distributions realizable by parameter θ and θ0 : F := ∇2θ0 DKL [pθ kpθ0 ]|θ0 =θ

I

Thus F gives curvature information in distribution space with KL-divergence as metric.

I

Second order Taylor series expansion of KL-divergence: 1 DKL [pθ kpθ0 ] ≈ (θ0 − θ)| F(θ0 − θ) 2

Natural Gradient

I

Second order method, similar to Newton’s method.

I

Use F instead of Hessian in the update θ = θ − αF−1 ∇θ pθ

I

Can be seen as ”adaptive learning rate”, which depends on the local curvature.

I

Independent to parameterization of θ. Only care about pθ itself.

Natural Gradient: Problem

I

Like Hessian, F is expensive: I I I

I

O(n2 ) storage. O(n3 ) inversion. n can be as high as 20M in VGGNet!

Needs an approximation! I

AdaGrad, RMSProp, Adam: diagonal Fisher approximation!

Kronecker Product

I

Let A ∈ Rm×n and B ∈ Rp×q .

I

Kronecker product is defined as:  a11 B · · ·  .. .. A⊗B= . . am1 B · · ·

I

A ⊗ B ∈ Rmp×nq .

 a1n B ..  . .  amn B

K-FAC

I

Consider l-th layer of a neural net with input al and output sl := Wl| al .

I

Weight gradient is given by: ∇Wl L = al (∇sl L)| = al gl| .

I

We have identity vec(uv | ) = v ⊗ u.

I

Approximate FIM of this layer’s weight with: Fl := E[vec(∇Wl L) vec(∇Wl L)| ] = E[gl gl| ⊗ al a|l ] ≈ E[gl gl| ] ⊗ E[al a|l ] =: Sl ⊗ Al

K-FAC: Assumption I I I

K-FAC: Fl ≈ E[gl gl| ] ⊗ E[al a|l ] K-FAC assumes independence between al and gl . If we assume further for each layer l, k with l 6= k, are independent. Then F is block diagonal constructed by Fl of each layer.

Figure: Exact F, K-FAC approximation, and their difference.

K-FAC: Inversion

I

Remember, in natural gradient we need to compute F−1 l ∇Wl L.

I

In K-FAC, Fl ≈ Sl ⊗ Al , so: −1 −1 F−1 l ∇Wl L = Sl ⊗ Al vec(∇Wl L) −1 = vec(A−1 l ∇Wl L Sl )

I

Thus we only need to store and invert the factors, instead of the full Fisher matrix!

K-FAC: Cost

I

Assume al ∈ Rn , sl ∈ Rm , and Wl ∈ Rn×m : I I I

I

Al ∈ Rn×n Sl ∈ Rm×m Space cost is thus O(n2 + m2 ) with inversion cost of O(n3 + m3 ).

Compare this to the cost of exact F: I I

F ∈ Rnm×nm Space cost in O(n2 m2 ), inversion cost in O(n3 m3 ).

K-FAC: Results

Conclusion

I

Second order method has big benefit.

I

Natural gradient takes step in distribution space instead of parameter space.

I

Natural gradient is expensive, so needs approximation.

I

K-FAC assumes independence of layers in neural nets and use Kronecker product to factorize the Fisher Matrix.

I

Much cheaper than exact natural gradient, but still gives a lot of benefits.

References and Further Readings

I

Martens, James, and Roger Grosse. ”Optimizing neural networks with kronecker-factored approximate curvature.” International conference on machine learning. 2015.

I

Intro to natural gradient: https://wiseodd.github.io/ techblog/2018/03/14/natural-gradient/.