Generating the Features for Category Representation using a Deep Convolutional Neural Network Chen-Ping
1 Yu (
[email protected]), 1Dept
of Computer Science,
Introduction
Justin
2Dept
2 Maxfield ,
and Gregory J.
1,2 Zelinsky
2016
of Psychology, Stony Brook University Experiments and Results
Deep Convolutional Neural Network Models (DCNNs)
Exp #1: Previous work ([16]) modeled categorical guidance and verification over three levels of a hierarchy (subordinate, basic, and superordinate) using category-consistent features built from SIFT and Bag-of-Words. How would fixation prediction using these biologically-uninspired features compare to those learned by a CNN of the ventral visual stream? Exp #2: How do fixation predictions using our Ventral-Stream Network (VsNet) model compare to AlexNet and Deep-HMAX DCCN models?
CNN-CCFs: CNNs learn hierarchical features directly from images. Can the CCFs extracted from CNNs do better in predicting guidance and verification performance? Yes, even at the level of individual object categories!
64 11x11,4 BN ReLU
3x3,2 mp
192 5x5,1 BN ReLU
384 3x3,1 BN ReLU
3x3,2 mp
256 3x3,1 BN ReLU
256 3x3,1 BN ReLU
131 px = 26.2º
99 px = 19.8º
51 px = 10.2º
Fully Connected Layers
3x3,2 mp
163 px = 32.6º
3x3,2 mp
RF size: 7 px = 1.4º
128 3x3,1 BN ReLU
13 px = 2.6º
17 px = 3.4º
Weight: 0.06
3x3,2 mp
Fine-tuning validation set accuracies
Imagenet
VsNet
Deep-HMAX
AlexNet
Top1 Acc
62.73%
60.01%
57.26%
Top5 Acc
84.66%
82.6%
80.61%
Model
by-pass connection
68-Cat Top1 Acc
VsNet
Deep-HMAX
AlexNet
Sub 48
93.79%
92.96%
92.92%
Basic 16
96.54%
96.33%
96.25%
Super 4
98.92%
99.00%
99.17%
Multi-Task fine-tuning was performed successfully, all three levels achieved near perfect and human-comparable classification accuracies.
TE1
TEOs1
192 3x3,1 BN ReLU
256 3x3,1 BN ReLU
29 px = 5.8º
37 px = 7.4º
3x3,2 mp
TE2
192 5x5,1 BN ReLU
256 3x3,1 BN ReLU
77 px = 15.4º
3x3,2 mp
Fully Connected Layers
37,93 px = 7.4º,18.6º
VsNet
TEOs2a
TEOs2b
TEOc
128 5x5,2 BN ReLU
192 5x5,1 BN ReLU
256 2x2,1 BN ReLU
3x3,2 ap
21 px = 4.2º
2x2,1 mp
Deep-HMAX*
AlexNet
Spearman
ρ
p-val
Layers
ρ
p-val
Layers
ρ
p-val
Layers
All 68
-.3189
.008
2
-.2089
.0873
2
-.0364
.7679
5
Sub 48
-.3674
.0102
2
-.1862
.2052
2
-.0317
.8306
1
Basic 16
-.2336
.3838
1
-.2492
.352
1
n/a
n/a
n/a
Super 4
-.6325
.5
1
-.4472
.6667
1
-.8
.3333
2
Positive Correlation Result : #CCF vs Time-to-target
Model
Behavioral Data: Time-to-target (guidance) and subsequent verification times were measured using a categorical search task (N=26).
4 Superordinate Labels
Negative Correlation Result: #CCF vs Time-to-target
V4c
V4s
96 3x3,1 BN ReLU
Weight: 0.24
Results show that more biologically-inspired architectures significantly outperform AlexNet.
Class Labels
64 7x7,1 BN ReLU
16 Basic Labels
ImageNet validation set accuracies
Deep-HMAX (inspired by [3]), ~76m parameters, 1,760 conv units V2
Weight: 0.7
More CCFs = higher specificity = faster time-to-target [16].
Figure from [4]
V1
48 Subordinate Labels
CNN-CCFs: For each object category, extract the neurons that are highly and reliably activated by its exemplars. (Thresholded Signal-to-noise Ratio).
Categorical Search & Category-Consistent Features (CCFs)
Dataset: 500 training and 50 validation images per category, for a total of 26,400 images over 48 subordinate categories (basics and superordinates are derived from subordinates)
2. Fine-tune the networks with our dataset, using multi-task learning: Pre-Trained Network
AlexNet [1][2], ~61m parameters, 1,152 conv units
RF size: 11 px = 2.2º
Hierarchical categories: we collected a 68category hierarchical dataset, with 48, 16, and 4 subordinate, basic, and superordinate categories, respectively.
1. Pre-train the networks using ImageNet for classification.
Biological-Plausibility: Most CNNs are learned to predict Classification Accuracy, but they have also been used to predict behavioral and neural data ([17]). We evaluated our VsNet model against biologically-inspired versions of AlexNet and HMAX [4] (Deep-HMAX).
Class Labels
Question: The search for object categories, categorical search, is characterized by a subordinate-level advantage in target guidance (e.g., time to target fixation) and a basic-level advantage in target verification. What are the underlying features that make these effects possible?
DepthConcat 640->512 1x1,1 BN ReLU
VsNet
Deep-HMAX*
AlexNet
Spearman
ρ
p-val
Layers
ρ
p-val
Layers
ρ
p-val
Layers
All 68
.3856
.0012
5
.2015
.0994
5
.3407
.0045
4
Sub 48
.4595
.001
5
.2672
.0664
5
.3593
.0121
4
Basic 16
.7251
.002
3,5
.5739
.0201
5
.6937
.0029
4
Super 4
1.0
.083
1,5
.9487
.1667
1
1.0
.083
3
* Since Deep-HMAX is “deeper” than the other two models, only the complex-cell layers are used for CCF extraction. Insignificant correlations (p > 0.05) are colored in red.
77 px = 15.4º
61 px = 12.2º
Ventral-Stream Network (VsNet), ~72m parameters, 835 conv units AlexNet Layer 1 feature visualization
Weak by-pass
V2
21 3x3,2 (0.6º)
22 3x3,1 (1.4º~ 2.2º)
hV4+LO1+LO2
111 3x3,1 (10.2º~ 15.8º) 3x3,2 mp
22 5x5,2 (1.0º)
Category-Consistent Features (CCFs): visual features learned by an unsupervised generative model that represent an object category, defined as common features that occur frequently and reliably. SIFT BoW-CCF: previous work used a Bag-ofWord model with SIFT features ([16]) as the base feature pool (1064-d); features with high signal-tonoise ratio (SNR) were identified as the CCFs.
TE
TEO
76 3x3,1 (3.8º~ 6.2º)
96 3x3,1 (5.4º~ 9.4º)
3x3,2 mp
5x5,4 mp
111 5x5,1 (13.4º~ 19.0º)
32 5x5,1 (2.2º~ 3.0º) 52 5x5,1 (5.4º~ 7.8º)
52 5x5,1 (7.0º~ 11º)
111 7x7,1 (16.6º~ 22.2º)
21 7x7,2 (1.4º)
32 7x7,1 (3.0º~ 3.8º)
BN ReLU DepthConcat(64)
BN ReLU DepthConcat(86)
BN ReLU DepthConcat(128)
BN ReLU DepthConcat(224)
BN ReLU DepthConcat(333)
RF Size Theory: 0.25º ~ 2º @ 5.5º Eccentricity: < 2º
0.5º ~ 4º 2º ~ 4º
1.5º ~ 8º 4º ~ 6º
~ 7+º > 7º
2.5º ~ 70º ?º
Fully Connected Layers
Class Labels
V1
Strong by-pass
Ventral-Stream area RF sizes: We pooled the human RF-size estimates from [4,5,6,7,8,9,10,11] for each area, based on 1º = 5 pixels. Variable RF sizes within an area: Given that neurons within each v-s area range in RF size, we created parallel convolutional kernels with different sizes within each v-s area (conv layer). The number of kernels for each of these parallel populations depend on the correspondence of the kernel’s RF size and the estimated RF size: more similar receives more shares of units. By-pass connections: V4 receives weak (foveal) bypass inputs from V1 [12], while TEO and TE receive strong bypass inputs from V2 and V4, respectively [13]. Taxi SIFT BoW
Taxi SIFT CCFs
# of neurons (conv kernels) of each area: In biological visual systems, neurons of the same selectivity are tiled across the visual field much like a convolution operation with a single filter, especially in the early layers. Therefore, the # of kernels per conv layer is estimated by surface areas [14,15], such that: 𝑺𝒖𝒓𝒇𝒂𝒄𝒆_𝑨𝒓𝒆𝒂 𝑽𝒊𝒔𝒖𝒂𝒍_𝑨𝒓𝒆𝒂 #_conv_kernels ∝ , 𝑫𝒖𝒑𝒍𝒊𝒄𝒂𝒕𝒊𝒐𝒏_𝑭𝒂𝒄𝒕𝒐𝒓 = log( ) , visual_area = 1000. 𝟐 𝑫𝒖𝒑𝒍𝒊𝒄𝒂𝒕𝒐𝒏_𝑭𝒂𝒄𝒕𝒐𝒓
VsNet Layer 1 feature visualization
Conclusion Brain-inspired networks are better at object category classification.
Ventral-Stream Network Design Principles
This model predicted guidance and verification across hierarchical level, but not at the level of individual categories!
Deep-HMAX Layer 1 feature visualization
𝑹𝑭_𝑺𝒊𝒛𝒆
hV4+LO1+LO2: Literature suggests that object shape & boundary are also processed by LO1,2 [14]. We therefore combine them with v4.
They also better predict guidance of attention to targets compared to AlexNet. Our VsNet is the first deep learning model that is brain-inspired. Future DCNNs should continue to exploit neural constraints so as to better predict behavior.
References & Acknowledgments [1] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. [2] A. Krizhevsky. One Weird Trick for Parallelizing Convolutional Neural Networks. Arxiv, 2014. [3] T. Serre, A. Oliva, and T. Poggio. A Feedforward Architecture Accounts for Rapid Categorization. PNAS, 2007. [4] D. Kravitz, K. Saleem, C. Baker, L. Ungerleider, and M. Mishkin. The Ventral Visual PathWay: An Expanded Neural Framework for the Processing of Object Quality. Trends in Cognitive Sciences, 2013. [5] D. Boussaoud, R. Desimone, and L. Ungerleider. Visual Topography of Area TEO in the Macaque. Journal of Comparative Neurology, 1991. [6] S. Kastner, P. Weerd, M. Pinsk, M. Elizondo, R. Desimone, and L. Ungerleider. Modulation of Sensory Suppression: Implications for Receptive Field Sizes in the Human Visual Cortex. Journal of Neurophysiology, 2001. [7] A. Smith, A. Williams, and M. Greenlee. Estimating Receptive Field Size from fMRI Data in Human Striate and Extrastriate Visual Cortex. Cerebral Cortex, 2001. [8] G. Rousselet, S. Thorpe, and M. Fabre-Thorpe. How Parallel is Visual Processing in the Ventral Pathway? Trends in Cognitive Sciences, 2004. [9] B. Harvey and S. Dumoulin. The Relationship between Cortical Magnification Factor and Population Receptive Field Size in Human Visual Cortex: Constancies in Cortical Architecture. Journal of Neuroscience, 2011. [10] J. Freeman and E. Simoncelli. Metamers of the Ventral Stream. Nature Neuroscience, 2011. [11] Michael Jenkin and Laurence Harris. Vision and Attention. Springer Sciences & Business Media, 2013. (Book) [12] R. Greger and U. Windhorst. Comprehensive Human Physiology, Vol. 1: From Cellular Mechanisms to Integration. Springer, 1996. (Book) [13] K. Tanaka. Mechanisms of Visual Object Recognition: Monkey and Human Studies. Current Opinion in Neurobiology, 1997. [14] J. Larsson and David Heeger. Two Retinotopic Visual Areas in Human Lateral Occipital Cortex. Journal of Neuroscience, 2006. [15] D. Felleman and D. Van Essen. Distributed Hierarchical Processing in the Primate Cerebral Cortex. Cerebral Cortex, 1991. [16] C.-P. Yu, J. Maxfield, and G. Zelinsky. Searching for Category-Consistent Features: A Computational Approach to Understanding Visual Category Representation. Psychological Science, 2016. [17] C. Cadieu, H. Hong, D. Yamins, N. Pinto, D. Ardila, E. Solomon, N. Majaj, and J. DiCarlo. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. PLOS Computational Biology, 2014.
This work was supported by NIMH grant R01-MH063746 and NSF grants IIS1111047 and IIS1161876 to G.J.Z.