Deep Learning.key - Chiyuan Zhang

2 downloads 127 Views 30MB Size Report
Jul 22, 2015 - River Raid. Bank Heist ... best reinforcement outperforms competing methods (also see Extended Data Table
Deep Learning: Review & Discussion Chiyuan Zhang, CSAIL, CBMM

2015.07.22

Overview •



What has been done? •

Applications



Main Challenges



Empirical Analysis



Theoretical Analysis

What is to be done?

Deep Learning

What has been done?

Applications •

Computer Vision •



Speech Recognition •



Deep Nets, Recurrent Neural Networks (RNNs), dominating, industrial deployment

Natural Language Processing •



ConvNets, dominating

Matched previous state-of-the-art, but no revolutionized results yet

Reinforcement Learning, Structured Prediction, Graphical Models, Unsupervised Learning, … •

“Unrolling” iteration as NN layer

Image Classification •

Imagenet Large Scale Visual Recognition Challenge (ILSVRC) 
 http://image-net.org/challenges/LSVRC/





Tasks •

Classification: 1000-way multiclass learning



Detection: classify and locate (bounding box)

State-of-the-art •

ConvNets since 2012



Olga Russakovsky, . . ., Andrej Karpathy and Li Fei-Fei et.al. ImageNet Large Scale Visual Recognition Challenge. arXiv: 1409.0575 [cs.CV].

Surpassing “Human Level” Performance •

Try it yourself: http://cs.stanford.edu/people/karpathy/ilsvrc/



For human •

Difficult & painful task (1000 classes)



One guy trained himself with 500 images and tested on 1500 (!!) images •

~ 1 minute to classify 1 images: ~ 25 hours…



~ 5% error, the so-called “human level” performance



Human and machines are making different kinds of errors, for details see
 http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)



e.g. google “Inception”, 27 layers, ~7M parameters; VGG ~100M parameters (table 2, arXiv:1409.1556).



Imagenet challenge training set ~1.2M images (p > N)



Typically takes ~1 week to train on a descent GPU node



Models pre-trained on Imagenet turns out to be very good feature extractor or initialization model for many other vision related tasks even on different datasets; popular in both academia and industry (startups) Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

input

Conv 7x7+2(S)

MaxPool 3x3+2(S)

LocalRespNorm

Conv 1x1+1(V)

Conv 3x3+1(S)

LocalRespNorm

MaxPool 3x3+2(S)

Conv 5x5+1(S)

Conv 3x3+1(S)

DepthConcat

Conv 5x5+1(S)

Conv 3x3+1(S)

DepthConcat

MaxPool 3x3+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

Conv 1x1+1(S)

AveragePool 5x5+3(V)

Conv 1x1+1(S)

FC

FC

SoftmaxActivation

softmax0

Figure 3: GoogLeNet network with all the bells and whistles

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+2(S)

Conv 5x5+1(S)

MaxPool 3x3+1(S)

Conv 3x3+1(S)

DepthConcat

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

Conv 5x5+1(S)

Conv 1x1+1(S)

AveragePool 5x5+3(V)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 3x3+1(S)

DepthConcat

Conv 5x5+1(S)

Conv 3x3+1(S)

DepthConcat

Conv 5x5+1(S)

MaxPool 3x3+1(S)

Conv 1x1+1(S)

Conv 3x3+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

DepthConcat

Conv 5x5+1(S)

Conv 3x3+1(S)

FC

SoftmaxActivation

DepthConcat

MaxPool 3x3+1(S)

softmax1

FC

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

MaxPool 3x3+1(S)

Conv 1x1+1(S)

MaxPool 3x3+2(S)

Conv 5x5+1(S)

Conv 3x3+1(S)

DepthConcat

Conv 5x5+1(S)

Conv 3x3+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

Conv 1x1+1(S)

DepthConcat

AveragePool 7x7+1(V)

FC

SoftmaxActivation

softmax2

ConvNets on ImageNet

Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. CVPR 2015. Kelvin Xu et. al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015. Remi Lebret et. al. Phrase-based Image Captioning. ICML 2015. …

Fancier applications

Image Captioning

Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).

Unrolling Iterative Algorithms as Layers of Deep Nets



Zheng, Shuai et al. “Conditional Random Fields as Recurrent Neural Networks.” arXiv.org cs.CV (2015).

Unrolling Multiplicative NMF Iterations

Jonathan Le Roux et. al. Deep NMF for Speech Separation. ICASSP 2015.

Speech Recognition •

RNNs: Non-fixed-length input, using context / memory for current prediction



Very deep neural network when unfolded in time, hard to train Image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications.

Realtime conversation translation

Reinforcement Learning & more





Google Deep Mind. Human-level control through deep reinforcement learning. Nature, Feb. 2015. Google Deep Mind. Neural Turing Machines. ArXiv 2014.

Video Pinball Boxing Breakout Star Gunner Robotank Atlantis Crazy Climber Gopher Demon Attack Name This Game Krull Assault Road Runner Kangaroo James Bond Tennis Pong Space Invaders Beam Rider Tutankham Kung-Fu Master Freeway Time Pilot Enduro Fishing Derby Up and Down Ice Hockey Q*bert H.E.R.O. Asterix Battle Zone Wizard of Wor Chopper Command Centipede Bank Heist River Raid Zaxxon Amidar Alien Venture Seaquest Double Dunk Bowling Ms. Pac-Man Asteroids Frostbite Gravitar Private Eye Montezuma's Revenge

At human-level or above Below human-level

DQN Best linear learner 0

100

200

300

400

500

600

1,000

4,500%

Deep Learning

What are the challenges?

Convergence of Optimization •

Gradients diminishing, lower layers hard to train •



Gradients explode or diminish •



ReLU, empirically faster convergence

Clever initialization (preserve variance / scale in each layer) •

Xaiver and variants: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv 2015.



Identity: Q. V. Le, N. Jaitly, G. E. Hinton. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. ArXiv 2015.



Memory gates: LSTM, Highway Networks (Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber. Highway Networks. ArXiv 2015), etc.



Batch normalization: Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015.

Many more tricks out there…

Regularization Overfitting problems do exist in deep learning •

“Baidu Overfitting Imagenet”: http://www.image-net.org/challenges/LSVRC/announcement-June-2-2015



Data augmentation commonly used in •

computer vision (random translation, rotation, cropping, mirroring…)



speech recognition •

e.g. Andrew Y. Ng et. al. Deep Speech: Scaling up end-to-end speech recognition. ArXiv 2015. 100,1000 hours (~11 years) of augmented speech data

Regularization Overfitting problems do exist in deep learning •



Dropout •

Intuition: forced to be robust; model averaging.



Justification •

Wager, Stefan, Sida Wang, and Percy S Liang. “Dropout Training as Adaptive Regularization.” NIPS 2013.



David McAllester. A PAC-Bayesian Tutorial with A Dropout Bound. ArXiv 2013.

Variations: DropConnect, DropLabel…

MNIST

TIMIT figure source: http://winsty.net/talks/dropout.pptx

Regularization Overfitting problems do exist in deep learning •

(Structured) sparsity comes into play •

Computer vision: ConvNets — sparse connection with weight sharing



Speech recognition: RNNs — time index correspondence, weight sharing



Unrolling: structure from algorithms



Behnam Neyshabur, Ryota Tomioka, Nathan Srebro. Norm-Based Capacity Control in Neural Networks. COLT 2015.



Q: is the sparsity pattern learnable?

Computation •

Hashing •



e.g. K.Q. Weinberger et. al. Compressing Neural Networks with the Hashing Trick. ICML 2015.

Limited numerical precision computing with stochastic rounding •

Suyog Gupta et. al. Deep Learning with Limited Numerical Precision. ICML 2015.

Deep Learning

Existing Empirical Analysis

Network Visualization •

Visualizing the learned filters



Visualizing high-response input images



Adversarial images



Reconstruction (what kind of information is perserved) Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.

Matthew D. Zeiler, Rob Fergus. Visualizing and Understanding Convolutional Networks. ECCV 2014.

the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.

Adversarial images for a trained CNN (or any classifier)

Christian Szegedy, …, Rob Fergus. Intriguing properties of neural networks. ICLR 2014.



1st column: original images.



2nd column: perturbations.



3rd column: perturbed images, all classified as “ostrich, Struthio camelus”.

Anh Nguyen, Jason Yosinski, Jeff Clune. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. CVPR 2015. http://www.evolvingai.org/ fooling

see also Supernormal Stimuli for human and animals: https://imgur.com/ a/ibMUn

Reconstruction from each layer of a CNN



Aravindh Mahendran, Andrea Vedaldi. Understanding Deep Image Representations by Inverting Them. CVPR 2015.



Jonathan Long, Ning Zhang, Trevor Darrell. Do Convnets Learn Correspondence? NIPS 2014.

Learning to reconstruct (from a trained CNN) •

Alexey Dosovitskiy, Thomas Brox. Inverting Convolutional Networks with Convolutional Networks. ArXiv: 1506.02753, 2015.



Learn a CNN to map from layer representation into the image space. Unlike auto-encoders, the existing CNN is trained discriminatively, and fixed.



Note spatial information is kind of perserved even in fc layers.

Deep Nets are easy to train? •

Ian J. Goodfellow, Oriol Vinyals, Andrew M. Saxe. Qualitatively characterizing neural network optimization problems. ICLR 2015.

Y. LeCun et. al. The Loss Surfaces of Multilayer Networks. AISTATS 2015. We study the connection between the highly non-convex loss function of a simple model of the fullyconnected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

see also https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work/

Deep vs Shallow (empirical performance) •

Lei Jimmy Ba, Rich Caruana. Do Deep Nets Really Need to be Deep? NIPS 2014. Train a shallow net to mimic a deeper one, i.e. train with soft labels produced by the deeper net.

Does it imply that deep nets have similar capacity as shallow nets, but easier to train (from discriminative labels)?

Deep vs Shallow (capacity of hypothesis space) •

Olivier Delalleau and Yoshua Bengio. Shallow vs. Deep Sum-Product Networks. NIPS 2011.



Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neural network classifiers. ESANN 2014. B(·) is the sum of Betti numbers, a topological complexity measure of a set (here the sublevelset {f ≧ 0})

Deep Nets vs Kernel Methods •

Zhiyun Lu et. al. How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets. ArXiv: 1411.4000, 2014. •

Random Fourier features, MKL, parallel optimization… seems to require a lot of tuning, tricks and man-powers (measured by number of authors)

Results on CIFAR-10. Note ConvNets can achieve much lower error (18%) than densely connected DNN on this dataset. see also: Po-Sen Huang et. al. Kernel Methods Match Deep Neural Networks On TIMIT. ICASSP 2014.

Deep Learning

Existing Theoretical Analysis

Provable learning of random sparse networks •

Sanjeev Arora, Aditya Bhaskara, Rong Ge, Tengyu Ma. Provable Bounds for Learning Some Deep Representations. ICML 2014.



In practice: layer-wise pretraining used to be popular, but gradually abandoned when larger amount of training data is available.

Learning 2 or 3-layer Nets with Quadratic Nonlinearity •

Roi Livni, Shai Shalev-Shwartz, Ohad Shamir. On the Computational Efficiency of Training Neural Networks. NIPS 2014.

t: depth; n: # nodes; L: weight size constraint; 𝝈2: square activation function.

Learning Network with 1 Hidden Layer •

Francis Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. ArXiv: 1412.8690, 2014.



Generalization bounds (approximation and estimation errors)



Formulated as learning from continuously infinitely many basis functions

Learning Boolean Networks •

Dustin G. Mixon, Jesse Peterson. Learning Boolean functions with concentrated spectra. ArXiv: 1507.04319, 2015.



Learning 1-hidden layer boolean network with highly concentrated Fourier transform.



See also https://dustingmixon.wordpress.com/2015/07/17/a-relaxation-of-deep-learning/

Deep Learning

Open Problems (?)

Depth? •

Is deeper networks a richer function space than shallow networks? •





Both empirical & theoretical analysis exist (see previous sections)

Is deeper networks easier to learn than shallow networks? •

Statistically



Computationally

Is there trade-off between depth and blah blah? •

Empirically people started to explore possibility of training networks with hundreds of layers. Although the current state-of-the-art networks are typically 10~30 layers.

Structure? •

It seems structures (sparse connections) is a very important factors for many successful networks



If structure is unknown, can we learn it? •

e.g. given samples from a statistical model defined with a sparsely connected deep network, can we estimate the sparsity pattern and parameter values?



Learning unknown invariances

Regularization? Dropout? •

Can dropout help to discover underlying sparsity structure?



Other regularization technique?

SGD? •

Why SGD on non-convex objective function works? •



c.f. reference to existing empirical & theoretical analysis of objective function surfaces in deep learning.

Is there alternative algorithm that •

has theoretical guarantee / justification?



has some nice properties? e.g. •

Easier to parallelize (SGD is sequential between mini-batches)



Biologically plausible (neuroscientists would like it)

Rectified Linear Unit (ReLU) •

ReLU is usually found to be better than Sigmoid activation functions (converges faster, and to better solution)



Intuitions exist, but is there rigorous justification?



Can we characterize properties for a “nice” activation functions? •

Other possible “good” activation functions?

Local minimums or Equivalent solutions •

Empirically, different random initialization leads to different solution, but almost equally good measured by classification performance



A huge number of equivalent solutions exist •

For ReLU, scale two layers accordingly does not change the final output



Permuting filter index properly in the network does not change the final output

Other problems? •

Unsupervised learning



Structured prediction



Weakly supervised learning