Visualizing and Understanding Convolutional Networks

11 downloads 260 Views 11MB Size Report
features of a deep network (2009)”. – count the gradient of ... the features learned by network is hierarchical, whe
8KUWCNK\KPI7PFGTUVCPFKPI

%QPXQNWVKQPCN0GVYQTMU

$TKCP*WCPI

Outline – intro – approach – training details – visualization of convent – experiment result – conclusion

Visualizing and Understanding Convolutional Networks - intro – CNN is flourishing – but the progression is based on the result, not the features itself – no clear understanding of why they perform so well, or how they might be improved – this paper introduce a novel visualization technique that gives insight into the function of inter feature layers

Visualizing and Understanding Convolutional Networks - approach – Visualization with a Deconvnet – a method refer to “Adaptive deconvolutional networks for mid and high level feature learning (2011)” – a method for unsupervised learning while used in this paper purely for visualization, without learning progress – improve the method mentioned in “Visualizing higher-layer features of a deep network (2009)” – count the gradient of input to a given neuron’s activation

Visualizing and Understanding Convolutional Networks - approach – visualization with a Deconvnet – Unpooling – Rectification – Convolutional filtering – the whole procedure is just similar to counting the error by SGD during back propagation

Unpooling max-pooling is non-invertible use switch for approximate inverse the large of the pooling size the more unlike reconstruction we get

空的地⽅方補0

Rectification 在 Alexnet 中 relu 函數是⽤於保證每層輸出的激活值都是正數 因此對於反向過程 我們同樣需要保證每層的特徵圖為正值 也就是說這個反激活過程和激活過程沒有什麼差別 都是直接採⽤ relu 函數

Convolutional filtering

flipping each filter vertically and horizontally

decent 的動作就好像是在做 conv

可以透過上⼀一層對下⼀一層的貢獻比率來來反推上⼀一層的值

也可以初始化反推時⽤用到的 weight 然後⽤用學習的

不過這裡不需要對 deconvnet 做學習

Convolutional filtering

deconvolution is just like convolution

Visualizing and Understanding Convolutional Networks - training – use Alexnet to train – some different settings from Alexnet – use 10 different sub-crop 224x224 for each image – train on only 1 GPU – renormalize each filter in the first convolutional layers whose RMS value exceeds a fixed radius to avoid dominant features due to the larger input

Visualizing and Understanding Convolutional Networks - visualization – the features learned by network is hierarchical, where higher layers learns combinational features which compose of features learned in low layers – Feature evolution during training – higher layer need more epochs to converge – Feature invariance – high-layer’s features is more invariant to input transformation

hierarchical feature corners & edge/color conjunctions

hierarchical feature similar texture

hierarchical feature

object part

entire object

feature evolution during training epochs: [1,2,5,10,20,30,40,64] ⾼層比較晚收斂 sudden jump results from different images (using SGD)

feature invariance

feature invariance

input transformation has significant influence to lower layer higher layer has higher feature invariance

Visualizing and Understanding Convolutional Networks - visualization – help improve network architecture – occlusion sensitivity in high activation features – lower layer has higher correspondence with the same specific object in different images

improve network architecture

smaller stride(2 vs 4), smaller filter size(7x7 vs 11x11) layer1: more mid-frequencies layer2: no aliasing, no dead feature

improve network architecture

occlusion sensitivity

model 是否確實運⽤了 feature 的資訊 還是利⽤了 feature 周遭的訊息︖

occlusion sensitivity

correspondence analysis 深度模型不同意其他識別⽅法 它沒有明確建⽴對應關係的機制(臉中有⿐⼦和眼) 但是深度模型可能偷偷地建⽴了對應機制 為了驗證這⼀點,我們隨機選擇了5張包含狗正臉的圖⽚,然後遮蔽相同的部分來觀察 觀察⽅法:⾸先計算圖⽚遮蔽前後特徵,兩個特徵向量做差,通過sign來確定符號 然後通過海明距離來計算5張照⽚兩兩之間的⼀致性,然後求和

xli and x̃li are the feature vectors at layer l for the original and occluded images respectively H is Hamming distance

correspondence analysis 較低的值表⽰ 對所產⽣的變化有更⼤的⼀致性 layer5 的特徵⼀致性中,眼和⿐⼦的數值較低 說明眼和⿐⼦比其他部分,有更強的相關性 這個分析證明了深度學習模型的確 “隱式” 的 建⽴ features 間的對應關係機制 但是在 layer7 中 眼、⿐⼦和隨機的數值較為相似 可能是因為⾼層嘗試⽤更複雜的 feature 組合來判別 這個實驗再⼀次驗證 P.16 中所提到的 “higher layer has higher feature invariance”

Visualizing and Understanding Convolutional Networks - experiment – improve Alexnet – learn generic features from ImageNet, which perform well in Caltech-101 and Caltech-256, and the result seems to be not bad in PASCAL 2012, which is a multiclass classification contest that differs from ILSVRC – train Caltech-101 and Caltech-256 with significant less data

Visualizing and Understanding Convolutional Networks - conclusion – a novel visualization approach – find some properties of the learned features in CNN – hierarchical, invariance, epochs…, etc. – a additional way to fine-tuned the network

Some thinkings – ⽤ feature 可視化來當做評估 model 好壞或許不夠嚴謹,或 許得到了幾個很棒的 features,但整體的 features 品質並沒 有提升。需要⼀個公正的⽅法來評估 – ⽤⼀個很⼤的 data set 來訓練,對其他較⼩的 data set 會有 不錯的性能

Reference – https://arxiv.org/pdf/1311.2901v3.pdf – https://courses.cs.washington.edu/courses/cse590v/ 14au/cse590v_dec5_DeepVis.pdf – implementation: http://kvfrans.com/visualizingfeatures-from-a-convolutional-neural-network/ – standford cs231n: http://cs231n.stanford.edu/ slides/winter1516_lecture9.pdf

6JCPMUHQTNKUVGPKPI

#P[3WGUVKQPU!