Modeling Clutter Perception using Parametric Proto ... - Chen-Ping Yu

2 downloads 0 Views 2MB Size Report
Chen-Ping Yu. 1. ([email protected]), Justin Maxfield. 2. , and Gregory J. Zelinsky. 1,2. References & Acknowledgments. [1] A. Krizhevsky, I. Sutskever, and ...
Generating the Features for Category Representation using a Deep Convolutional Neural Network Chen-Ping

1 Yu ([email protected]), 1Dept

of Computer Science,

Introduction

Justin

2Dept

2 Maxfield ,

and Gregory J.

1,2 Zelinsky

2016

of Psychology, Stony Brook University Experiments and Results

Deep Convolutional Neural Network Models (DCNNs)

 Exp #1: Previous work ([16]) modeled categorical guidance and verification over three levels of a hierarchy (subordinate, basic, and superordinate) using category-consistent features built from SIFT and Bag-of-Words. How would fixation prediction using these biologically-uninspired features compare to those learned by a CNN of the ventral visual stream?  Exp #2: How do fixation predictions using our Ventral-Stream Network (VsNet) model compare to AlexNet and Deep-HMAX DCCN models?

 CNN-CCFs: CNNs learn hierarchical features directly from images. Can the CCFs extracted from CNNs do better in predicting guidance and verification performance? Yes, even at the level of individual object categories!

64 11x11,4 BN ReLU

3x3,2 mp

192 5x5,1 BN ReLU

384 3x3,1 BN ReLU

3x3,2 mp

256 3x3,1 BN ReLU

256 3x3,1 BN ReLU

131 px = 26.2º

99 px = 19.8º

51 px = 10.2º

Fully Connected Layers

3x3,2 mp

163 px = 32.6º

3x3,2 mp

RF size: 7 px = 1.4º

128 3x3,1 BN ReLU

13 px = 2.6º

17 px = 3.4º

Weight: 0.06

3x3,2 mp

Fine-tuning validation set accuracies

Imagenet

VsNet

Deep-HMAX

AlexNet

Top1 Acc

62.73%

60.01%

57.26%

Top5 Acc

84.66%

82.6%

80.61%

Model

by-pass connection

68-Cat Top1 Acc

VsNet

Deep-HMAX

AlexNet

Sub 48

93.79%

92.96%

92.92%

Basic 16

96.54%

96.33%

96.25%

Super 4

98.92%

99.00%

99.17%

Multi-Task fine-tuning was performed successfully, all three levels achieved near perfect and human-comparable classification accuracies.

TE1

TEOs1

192 3x3,1 BN ReLU

256 3x3,1 BN ReLU

29 px = 5.8º

37 px = 7.4º

3x3,2 mp

TE2

192 5x5,1 BN ReLU

256 3x3,1 BN ReLU

77 px = 15.4º

3x3,2 mp

Fully Connected Layers

37,93 px = 7.4º,18.6º

VsNet

TEOs2a

TEOs2b

TEOc

128 5x5,2 BN ReLU

192 5x5,1 BN ReLU

256 2x2,1 BN ReLU

3x3,2 ap

21 px = 4.2º

2x2,1 mp

Deep-HMAX*

AlexNet

Spearman

ρ

p-val

Layers

ρ

p-val

Layers

ρ

p-val

Layers

All 68

-.3189

.008

2

-.2089

.0873

2

-.0364

.7679

5

Sub 48

-.3674

.0102

2

-.1862

.2052

2

-.0317

.8306

1

Basic 16

-.2336

.3838

1

-.2492

.352

1

n/a

n/a

n/a

Super 4

-.6325

.5

1

-.4472

.6667

1

-.8

.3333

2

Positive Correlation Result : #CCF vs Time-to-target

Model

 Behavioral Data: Time-to-target (guidance) and subsequent verification times were measured using a categorical search task (N=26).

4 Superordinate Labels

Negative Correlation Result: #CCF vs Time-to-target

V4c

V4s

96 3x3,1 BN ReLU

Weight: 0.24

Results show that more biologically-inspired architectures significantly outperform AlexNet.

Class Labels

64 7x7,1 BN ReLU

16 Basic Labels

ImageNet validation set accuracies

Deep-HMAX (inspired by [3]), ~76m parameters, 1,760 conv units V2

Weight: 0.7

 More CCFs = higher specificity = faster time-to-target [16].

Figure from [4]

V1

48 Subordinate Labels

 CNN-CCFs: For each object category, extract the neurons that are highly and reliably activated by its exemplars. (Thresholded Signal-to-noise Ratio).

Categorical Search & Category-Consistent Features (CCFs)

 Dataset: 500 training and 50 validation images per category, for a total of 26,400 images over 48 subordinate categories (basics and superordinates are derived from subordinates)

 2. Fine-tune the networks with our dataset, using multi-task learning: Pre-Trained Network

AlexNet [1][2], ~61m parameters, 1,152 conv units

RF size: 11 px = 2.2º

 Hierarchical categories: we collected a 68category hierarchical dataset, with 48, 16, and 4 subordinate, basic, and superordinate categories, respectively.

 1. Pre-train the networks using ImageNet for classification.

 Biological-Plausibility: Most CNNs are learned to predict Classification Accuracy, but they have also been used to predict behavioral and neural data ([17]). We evaluated our VsNet model against biologically-inspired versions of AlexNet and HMAX [4] (Deep-HMAX).

Class Labels

 Question: The search for object categories, categorical search, is characterized by a subordinate-level advantage in target guidance (e.g., time to target fixation) and a basic-level advantage in target verification. What are the underlying features that make these effects possible?

DepthConcat 640->512 1x1,1 BN ReLU

VsNet

Deep-HMAX*

AlexNet

Spearman

ρ

p-val

Layers

ρ

p-val

Layers

ρ

p-val

Layers

All 68

.3856

.0012

5

.2015

.0994

5

.3407

.0045

4

Sub 48

.4595

.001

5

.2672

.0664

5

.3593

.0121

4

Basic 16

.7251

.002

3,5

.5739

.0201

5

.6937

.0029

4

Super 4

1.0

.083

1,5

.9487

.1667

1

1.0

.083

3

* Since Deep-HMAX is “deeper” than the other two models, only the complex-cell layers are used for CCF extraction. Insignificant correlations (p > 0.05) are colored in red.

77 px = 15.4º

61 px = 12.2º

Ventral-Stream Network (VsNet), ~72m parameters, 835 conv units AlexNet Layer 1 feature visualization

Weak by-pass

V2

21 3x3,2 (0.6º)

22 3x3,1 (1.4º~ 2.2º)

hV4+LO1+LO2

111 3x3,1 (10.2º~ 15.8º) 3x3,2 mp

22 5x5,2 (1.0º)

 Category-Consistent Features (CCFs): visual features learned by an unsupervised generative model that represent an object category, defined as common features that occur frequently and reliably.  SIFT BoW-CCF: previous work used a Bag-ofWord model with SIFT features ([16]) as the base feature pool (1064-d); features with high signal-tonoise ratio (SNR) were identified as the CCFs.

TE

TEO

76 3x3,1 (3.8º~ 6.2º)

96 3x3,1 (5.4º~ 9.4º)

3x3,2 mp

5x5,4 mp

111 5x5,1 (13.4º~ 19.0º)

32 5x5,1 (2.2º~ 3.0º) 52 5x5,1 (5.4º~ 7.8º)

52 5x5,1 (7.0º~ 11º)

111 7x7,1 (16.6º~ 22.2º)

21 7x7,2 (1.4º)

32 7x7,1 (3.0º~ 3.8º)

BN ReLU DepthConcat(64)

BN ReLU DepthConcat(86)

BN ReLU DepthConcat(128)

BN ReLU DepthConcat(224)

BN ReLU DepthConcat(333)

RF Size Theory: 0.25º ~ 2º @ 5.5º Eccentricity: < 2º

0.5º ~ 4º 2º ~ 4º

1.5º ~ 8º 4º ~ 6º

~ 7+º > 7º

2.5º ~ 70º ?º

Fully Connected Layers

Class Labels

V1

Strong by-pass

 Ventral-Stream area RF sizes: We pooled the human RF-size estimates from [4,5,6,7,8,9,10,11] for each area, based on 1º = 5 pixels.  Variable RF sizes within an area: Given that neurons within each v-s area range in RF size, we created parallel convolutional kernels with different sizes within each v-s area (conv layer). The number of kernels for each of these parallel populations depend on the correspondence of the kernel’s RF size and the estimated RF size: more similar receives more shares of units.  By-pass connections: V4 receives weak (foveal) bypass inputs from V1 [12], while TEO and TE receive strong bypass inputs from V2 and V4, respectively [13]. Taxi SIFT BoW

Taxi SIFT CCFs

 # of neurons (conv kernels) of each area: In biological visual systems, neurons of the same selectivity are tiled across the visual field much like a convolution operation with a single filter, especially in the early layers. Therefore, the # of kernels per conv layer is estimated by surface areas [14,15], such that: 𝑺𝒖𝒓𝒇𝒂𝒄𝒆_𝑨𝒓𝒆𝒂 𝑽𝒊𝒔𝒖𝒂𝒍_𝑨𝒓𝒆𝒂 #_conv_kernels ∝ , 𝑫𝒖𝒑𝒍𝒊𝒄𝒂𝒕𝒊𝒐𝒏_𝑭𝒂𝒄𝒕𝒐𝒓 = log( ) , visual_area = 1000. 𝟐 𝑫𝒖𝒑𝒍𝒊𝒄𝒂𝒕𝒐𝒏_𝑭𝒂𝒄𝒕𝒐𝒓

VsNet Layer 1 feature visualization

Conclusion  Brain-inspired networks are better at object category classification.

Ventral-Stream Network Design Principles

This model predicted guidance and verification across hierarchical level, but not at the level of individual categories!

Deep-HMAX Layer 1 feature visualization

𝑹𝑭_𝑺𝒊𝒛𝒆

 hV4+LO1+LO2: Literature suggests that object shape & boundary are also processed by LO1,2 [14]. We therefore combine them with v4.

 They also better predict guidance of attention to targets compared to AlexNet.  Our VsNet is the first deep learning model that is brain-inspired. Future DCNNs should continue to exploit neural constraints so as to better predict behavior.

References & Acknowledgments [1] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. [2] A. Krizhevsky. One Weird Trick for Parallelizing Convolutional Neural Networks. Arxiv, 2014. [3] T. Serre, A. Oliva, and T. Poggio. A Feedforward Architecture Accounts for Rapid Categorization. PNAS, 2007. [4] D. Kravitz, K. Saleem, C. Baker, L. Ungerleider, and M. Mishkin. The Ventral Visual PathWay: An Expanded Neural Framework for the Processing of Object Quality. Trends in Cognitive Sciences, 2013. [5] D. Boussaoud, R. Desimone, and L. Ungerleider. Visual Topography of Area TEO in the Macaque. Journal of Comparative Neurology, 1991. [6] S. Kastner, P. Weerd, M. Pinsk, M. Elizondo, R. Desimone, and L. Ungerleider. Modulation of Sensory Suppression: Implications for Receptive Field Sizes in the Human Visual Cortex. Journal of Neurophysiology, 2001. [7] A. Smith, A. Williams, and M. Greenlee. Estimating Receptive Field Size from fMRI Data in Human Striate and Extrastriate Visual Cortex. Cerebral Cortex, 2001. [8] G. Rousselet, S. Thorpe, and M. Fabre-Thorpe. How Parallel is Visual Processing in the Ventral Pathway? Trends in Cognitive Sciences, 2004. [9] B. Harvey and S. Dumoulin. The Relationship between Cortical Magnification Factor and Population Receptive Field Size in Human Visual Cortex: Constancies in Cortical Architecture. Journal of Neuroscience, 2011. [10] J. Freeman and E. Simoncelli. Metamers of the Ventral Stream. Nature Neuroscience, 2011. [11] Michael Jenkin and Laurence Harris. Vision and Attention. Springer Sciences & Business Media, 2013. (Book) [12] R. Greger and U. Windhorst. Comprehensive Human Physiology, Vol. 1: From Cellular Mechanisms to Integration. Springer, 1996. (Book) [13] K. Tanaka. Mechanisms of Visual Object Recognition: Monkey and Human Studies. Current Opinion in Neurobiology, 1997. [14] J. Larsson and David Heeger. Two Retinotopic Visual Areas in Human Lateral Occipital Cortex. Journal of Neuroscience, 2006. [15] D. Felleman and D. Van Essen. Distributed Hierarchical Processing in the Primate Cerebral Cortex. Cerebral Cortex, 1991. [16] C.-P. Yu, J. Maxfield, and G. Zelinsky. Searching for Category-Consistent Features: A Computational Approach to Understanding Visual Category Representation. Psychological Science, 2016. [17] C. Cadieu, H. Hong, D. Yamins, N. Pinto, D. Ardila, E. Solomon, N. Majaj, and J. DiCarlo. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. PLOS Computational Biology, 2014.

This work was supported by NIMH grant R01-MH063746 and NSF grants IIS1111047 and IIS1161876 to G.J.Z.

Suggest Documents