Enhanced Biologically Inspired Model for Object Recognition

4 downloads 0 Views 2MB Size Report
alleviate these limitations, we develop an enhanced BIM (EBIM) in terms of the following two ... Index Terms—Biologically inspired model (BIM), feedback,.
1668

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

Enhanced Biologically Inspired Model for Object Recognition Yongzhen Huang, Student Member, IEEE, Kaiqi Huang, Senior Member, IEEE, Dacheng Tao, Member, IEEE, Tieniu Tan, Fellow, IEEE, and Xuelong Li, Senior Member, IEEE

Abstract—The biologically inspired model (BIM) proposed by Serre et al. presents a promising solution to object categorization. It emulates the process of object recognition in primates’ visual cortex by constructing a set of scale- and position-tolerant features whose properties are similar to those of the cells along the ventral stream of visual cortex. However, BIM has potential to be further improved in two aspects: mismatch by dense input and randomly feature selection due to the feedforward framework. To solve or alleviate these limitations, we develop an enhanced BIM (EBIM) in terms of the following two aspects: 1) removing uninformative inputs by imposing sparsity constraints, 2) apply a feedback loop to middle level feature selection. Each aspect is motivated by relevant psychophysical research findings. To show the effectiveness of the EBIM, we apply it to object categorization and conduct empirical studies on four computer vision data sets. Experimental results demonstrate that the EBIM outperforms the BIM and is comparable to state-of-the-art approaches in terms of accuracy. Moreover, the new system is about 20 times faster than the BIM. Index Terms—Biologically inspired model (BIM), feedback, object recognition, sparseness.

I. I NTRODUCTION

O

BJECT categorization plays an important role in many applications related to computer vision, e.g., video surveillance [4], [5], image and video retrieval [6]–[8], web content analysis [9], human computer interactions [10], and biometrics [11], [12]. In general, object categorization is a difficult task in computer vision because of the variability in Manuscript received July 13, 2010; revised November 21, 2010 and March 16, 2011; accepted April 8, 2011. Date of publication July 14, 2011; date of current version November 18, 2011. This paper updates and extends an earlier presentation [1] of this research on International Conference on Computer Vision and Pattern Recognition, 2008. This work is supported by National Natural Science Foundation of China (Grants 60875021, 60723005, 61072093), NLPR 2008NLPRZY-2, National Hi-Tech Research and Development Program of China (2009AA01Z318), Key Project of Tsinghua National Laboratory for Information Science and Technology, National Basic Research Program of China (973 Program) (Grant 2011CB707000) and Open Project Foundation of State Key Laboratory of Industrial Control Technology, (Grant ICT1105). This paper was recommended by Associate Editor S. Sarkar. Y. Huang, K. Huang, and T. Tan are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]; [email protected]; tnt@ nlpr.ia.ac.cn). D. Tao is with the Centre for Quantum Computation and Information Systems, University of Technology, Sydney, NSW 2007, Australia (e-mail: [email protected]). X. Li is with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2011.2158418

illumination, scales, rotation, deformation and clutter as well as the complexity and variety of backgrounds. It also poses a great challenge to recognize object categories in the large intraclass variation case, i.e., intraclass samples have different appearances. In addition, it is difficult to model the relationship of interclass samples, e.g., jeeps, cars, and vans are of different styles but can be categorized into an identical class. Despite the above challenges, the last three decades have witnessed the great development of object categorization and a large number of related algorithms have been proposed, some of which are discussed below. Traditional appearance-based approaches mainly use global low-level visual features such as gray value, color, shape, and texture [13]–[17]. These methods do not consider local discriminative information and are sensitive to lighting conditions, object poses, clutter, and occlusions. Local featurebased methods combine interest point detectors [18]–[21] and local descriptors [22]–[26] with spatial information. Part-based models [27]–[29] make matches between particular patches and interesting objects through various searching schemes. In this framework, it is challenging to robustly segment and find the meaningful parts, so the spatial relationships of meaningful parts cannot be duly modeled. The original bag-of-featuresbased scheme [30] is efficient for recognition, but it ignores the spatial relationship of features, and thus it is hard to represent the geometric structure of the object class or to distinguish between foreground and background features. Recent research findings in brain cognition and computer vision demonstrate that visual cognitive models are valuable in enhancing the performance of object recognition. For example, Serre et al. [2], [3] developed a biologically inspired model (BIM) for object recognition by emulating the mechanism of primates’ visual cortex. In experiments on several public databases, the BIM performs comparably to the state-of-the-art approaches. It is an initial but promising computational model for mimicking object recognition in the cortex of primates, and deserves further investigations. In particular, the BIM can be enhanced in the following two aspects: Case 1) First, input images are convolved with Gabor filters of various scales and orientations to increase the selectivity. Then, each convolved image is matched with a large number of stored prototypes at every position and scale to enhance the invariance. Such dense inputs contain much noise and bring mismatches in the later processing. Case 2) Second, the BIM applies a feedforward framework that blindly selects features for combination. In this

1083-4419/$26.00 © 2011 IEEE

HUANG et al.: ENHANCED BIM FOR OBJECT RECOGNITION

1669

framework, a C2 value is computed by taking a “max” over all corresponding S2 units associated with a prototype (an image patch). The meaning of “max” can be understood as the best match between the S2 units and a given patch. As these prototypes are randomly selected, the reliability of the match depends on a large number of prototypes. Therefore, the computational cost is very expensive. For example, on the Caltech5 database, the BIM requires 1000 features to obtain a stable performance and thus it takes around 30 s to process an image (500 × 500) using the original MATLAB code [31] in a PC with Intel Core II 2.4G CPU and 4G RAM. To deal with the aforementioned limitations, we develop an enhanced BIM (EBIM) and attempt to improve the original model in two aspects: 1) adding sparsity constraints in the S1 units of the BIM; 2) adopting a feedback procedure to replace the feedforward framework. It is worth noting that these developments are originally motivated by relevant psychophysical findings and can be explained from the viewpoint of computer vision. Here, we give a brief description. First, the sparsity constraints remove a large part of input pixels (most of them are noise) in the early stage of the system. The sparsification can reduce many mismatches in the later computation. Second, the feedback procedure improves the ability to choose the most effective patches in generating S2 units. It reduces a large number of redundant responses and greatly enhances the speed of the system. To justify the proposed developments, we apply the EBIM to object categorization and evaluate it in both accuracy and speed. Empirical studies on the CalTech5 database, the MITCBCL Street Scene database, the GRAZ database and PASCAL Visual Object Classes Challenge 2007 show that the EBIM outperforms the BIM in terms of accuracy and is competitive to top level algorithms. Moreover, the EBIM enhances the speed more than 20 times compared with the BIM. It processes more than 50 images (128 × 128) per second for object categorization and is ready for most real time applications. The rest of this paper is organized as follows. In Section II, we briefly introduce the BIM, describe its two particular limitations in detail and review representative extensions. In Section III, we elaborate the EBIM and show how the problems in the BIM can be solved in the EBIM. Section IV provides empirical studies on four public data sets. Section V discusses the theoretical insight via comparison with other improvements to BIM. Finally, Section VI concludes the paper.

Fig. 1. Example of generating the C1 unit from two adjacent S1 units. Each point in the C1 image is the maximum point of two local areas from the corresponding two S1 units. θ and S correspond to the orientation and the scale of Gabor filters.

convolved with different Gabor filters to produce the S1 layer. Each of the Gabor filters is the product of an elliptical Gaussian envelope and a complex plane wave:  2    2π x0 + γ 2 y02 x0 F (x, y) = exp − × cos (1) 2σ 2 λ x0 = x cos θ + y sin θ,

y0 = −x sin θ + y cos θ

(2)

where the ranges of x and y are associated with the scales of Gabor filters and θ controls the orientations. The Gabor filter is similar to the receptive field profiles in the mammalian cortical simple cells [32], and has good orientation and frequency selectivity for image processing. C1 Units: The C1 units describe complex cells in the visual cortex. To generate a C1 unit, BIM pools over S1 units using a maximum operation which keeps the maximum response of S1 units in a local area: r = max xi

(3)

i∈s

where xi is the response of an S1 unit in a local area s, e.g., a 4 × 4 region, r is the response of a C1 unit. This process can be considered as a down-sampling operation over the S1 images (see Fig. 1). S2 Units: The S2 units describe the similarity between C1 images and prototypes via convolution operation. An S2 image is calculated by

II. BIM: L IMITATIONS AND E XTENSIONS

S2(i, j, k) = exp (−β ∗ conv (C1(j, k), Pi ))

The BIM [2], [3] consists of four layers of computational units: S1, C1, S2, and C2. We first describe operations and limitations associated with these four layers in detail, and then review representative extensions of the BIM.

where conv denotes the convolution operation, C1(j, k) is the afferent C1 image with a specific scale j and a specific orientation k, Pi is an patch randomly sampled from C1 images of all positive training samples, and β defines the sharpness of the exponential function. C2 Units: a C2 value is the global maximum response of a group of S2 image over locations, scales and orientations.

A. BIM and Its Limitations S1 Units: The units in the S1 layer correspond to simple cells in the primates’ visual cortex. An initial input image is

C2(i) = max (S2(i, :, :)) .

(4)

(5)

1670

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

Fig. 2. Framework of the EBIM. The upper part illustrates the basic framework of BIM. The bottom part shows our proposed sparsity constraint and feedback to middle level feature. More details can be found in the beginning of Section III.

Since each prototype corresponds to a group of S2 images, BIM produces N C2 values for each input image, where N is the number of stored prototypes. Feedforward Scheme: The BIM is a feedforward procedure, which follows the theory accounting for the process of the first 100–200 ms in the ventral stream of primate visual cortex [33], [34]. However, feedback does exist in later processes for recognition task and the BIM could, in principle, be used as the front-end of a visual system, as a part of the predictionverification loop [35]. Wolf et al. [36] theoretically verified the feasibility of using a simple feedback framework in the BIM. Moreover, without verification from feedback loops to remove redundant patches, the BIM has to randomly sample a large number (about 1000–5000) of patches to achieve stable performance, which greatly increases the mismatches and causes heavy computation, e.g., it takes about 30 s to process a 500 × 500 image using the original BIM code [31]. For practical recognition applications, this is unbearable. B. Representative Extensions Due to the great theoretical value of the BIM, many extensions have been developed recently. Mutch and Lowe [37] improved the BIM in three biologically reasonable ways using sparsity, lateral inhibition and feature selection. Bileschi and Wolf [38] proposed four new image features, inspired by the Gestalt principles of continuity, symmetry, closure, and repetition. These two extensions enhance recognition accuracy, but the heavy computational load is still a big problem. For example, it takes several seconds to process and classify an image (128 × 128) on a 2-GHz Intel Pentium server in [37]. The system presented in [38] is slower, requiring approximately 80 s to process an image (128 × 128). Thus it is far from practical applications. Analysis about Mutch and Lowe’s extension will be detailed in the next section. Wolf et al. [36] enumerated some possible organization structures based on the BIM. But the feedback in their work is restricted on the C1 units and the relative improvements are limited. Jimenez and Blanca [39] evaluated the performance of the BIM by tuning some of

the model parameters and comparing different categorization procedures: Boosting, Support Vector Machines and JointBoosting classifiers. Although this work performs well, it is an empirical study rather than a theoretical improvement to the BIM. Meyers and Wolf [40] modified the BIM in two main aspects: adding a center-surround process to each scale band to extract the S1 features, and replacing the initial S2 feature with a linear combination of the initial C1 features. This is the first successful effort to extend the BIM to the application of biometrics and the new feature (termed S2FF [40]) achieves excellent performance in face recognition. Although there are many extensions of the BIM, our work is inspired by different biological and neurological evidences. Moreover, sufficient experiments demonstrate that the EBIM not only increases classification accuracy but also highly reduces the computation which makes the BIM applicable to real time applications. This is a clear advantage over other extensions. III. E NHANCED BIM According to the discussions in Section II, the original BIM can be improved in two aspects: the dense inputs and blind feature selection based on feedforward framework. In this section we propose an enhanced BIM (EBIM) to address these limitations, respectively by imposing sparsity constraints and using a feedback procedure for effective feature selection. Fig. 2 shows the framework of the EBIM. The input image is firstly convolved with Gabor filters of different parameters (orientations and scales) by (1) and (2). The filtered images are then processed by the sparsity constraints to produce the S1 units and the C1 units. Afterward, the C1 units are convolved with patches by (4). These patches are randomly selected from the C1 units of all positive training samples. Each element of a C2 histogram is the maximum point of the corresponding S2 image with a particular orientation and scale. Thus, if a group of S2 units (produced by one patch) have two scales and four orientations, the corresponding C2 histogram is eight-dimensional. Each histogram is sent to a Support Vector

HUANG et al.: ENHANCED BIM FOR OBJECT RECOGNITION

1671

Fig. 3. Illustration of the sparsity constraints effect in the S1 units. α is the parameter defined in Eq. (6). α = 0 means that there are no sparsity constraints. These figures show that a proper sparsity constraint can remove some noises while preserve useful information.

Machine (SVM) for classification. Thus, N patches generate N SVMs, which are weak learners in Boosting procedure. In the feedback process, we choose and combine SVMs with different weights according to their performance on training samples to construct the final decision. A. Sparsity of Input Information The dense S1 units in BIM bring high computational costs to the system: every pixel is processed in later stages. Besides, such dense inputs are likely to introduce mismatches, because noise’s responses are retained for matching. In fact, only a small part of the image information is useful for recognition in most cases. Research from neuroscience [41] indicates that the sparse coding strategy for natural image is sufficient to account for the characterization of the receptive fields of simple cells in mammalian primary visual cortex: spatially localized, oriented and bandpass (selective to structure at different spatial scales). Inspired by their work, we propose the sparsity constraints for the BIM. Different from the BIM, which computes every pixel of an image, the sparsification in the EBIM concerns only interest points and their neighbors. In particular, we compute the horizontal and vertical gradients over the original S1 image and retain special interest points, each of which satisfies the following condition: n          Fx(k)  + Fy(k)  Fx(i)  + Fy(i)  ≥ α n

(6)

k=1

where Fx and Fy are horizontal and vertical gradients calculated by gradient operators [−1 0 1] and [−1 0 1] , respectively, n is the number of pixels in the image of the S1 layer and α controls the degree of sparsity strength. The proposed sparsity constraints remove points with low gradients which are uninformative pixels in most cases. Besides the interest points, we also retain their neighbors which are one pixel away from the special interest points. Fig. 3 shows the results of sparsity constraints with different α. We choose gradient rather than DoG or Laplace filters as the basis for the sparsity. The reason is that: 1) DoG and Laplace filters are second-order filters, so they are more time

consuming than the gradient filter; and 2) in the early stage of the EBIM, Gabor filters have already produce similar response to the second-order filters, so DoG and Laplace filters will not be superior to the gradient filter. We compare the performance of the above three filters in sparsification. The experimental results (shown in Fig. 6) basically prove our analysis. The most expensive step is actually the convolution operation from the C1 layer to the S2 layer where thousands of patches are convolved with the C1 images. Our sparsity operation is conducted prior to this step. Thus, the computational cost is reduced because the input images (C1 images) are largely sparse in the convolution operation. The experiments in Fig. 9(a) show the enhancement of speed due to sparsity. Mutch and Lowe [37] presented a different strategy for sparsity constraints, where only the strongest response is kept in four orientations of the S2 units. This strategy makes the S2 units less sensitive to local clutter, and thus improves the generalization. However, this method could lose useful cues especially when the strongest response and other ones are much different. In contrast, our sparsity strategy removes uninformative pixels (with low gradients). It is embedded in an earlier stage (i.e., the S1 unit) and more efficient than the method by Mutch and Lowe. B. Feedback Framework The BIM, a feedforward system, consists of four levels and each level of the hierarchy is only used to produce the subsequent one. The main support of the feedforward framework is that recognition tasks are mostly finished in the first 100–200 ms of the visual perception and this time mainly involves feedforward processing for the primate visual cortex [42]–[45]. Experiments of these papers established an upper bound on how fast categorical decisions can be made in the human visual system, and suggested that there is no time for eye movement or shifts of attention. Therefore, the supporters argued that categorical decisions could only be implemented by a feedforward mechanism. The feedforward, however, is not suitable to apply high level information in the object classification task. Hochstein and Ahissar [46] put forward the reverse hierarchy theory (RHT). They argued that the visual information initially travels through the feedforward visual hierarchy, then reacts at higher levels, and finally reaches lower levels via feedback connections, forming a reverse hierarchy. The inclusion of cognitive factors in object classification is also supported by Murphy and Medin’s research on similarity of objects [47]. They claimed that similarity of objects is not an absolute quality but rather a relative quality defined by feature selection. From the viewpoint of computer vision, the object classification task is more complex than the feedforward mechanism for information processing. Without the feedback procedure for information selection and combination, the BIM requires a large number of patches from positive samples in the C1 layer for matching. This operation leads to a huge computation cost. Besides, the procedure of randomly sampling selects a lot of uninformative patches. In the procedure of generating S2 layer, all patches are used to match with the C1 layer of the input image, so these uninformative

1672

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

Fig. 4. Illustration of constructing SVM classifiers. Each patch group Pi (with four images corresponding to four orientations) is convolved with the C1 units (with eight images corresponding to two scales and four orientations). Every image in the C1 units will finally generate a response in the later procedure. Thus, each C2 histogram has eight bins.

patches are likely to bring many mismatches. Usually, these uninformative patches are sampled from background. Based on the above analysis, we propose a feedback scheme to replace the random patch selection stage in the BIM. The main motivation of our feedback procedure is that different patches play different roles in classifying objects. Unlike the BIM, which blindly uses a large number of patches for classification, we consider that a very small number of patches are sufficient for effective classification. Our feedback procedure is to find the most discriminative patches as well as their different contributions to the classification task. The feedback scheme is constructed in the manner of Boosting [48], [49]. It is adaptive in the sense that subsequent classifiers are built to enhance those previously misclassified samples by increasing their weights. We choose Support Vector Machine (SVM) as the weak classifier in Boosting. SVM is a powerful classifier to achieve a fast convergent rate in the Boosting procedure. Apart from the enhancement of speed obtained by combining Boosting and SVM, this ensemble can reduce the imbalance problem in SVM and generalize better than one single SVM. By using patches Pi , (i = 1, 2, . . . , N ) selected from the C1 units of all positive samples, we construct N SVM classifiers (each patch corresponds to a classifier). For example, in Fig. 4, for a patch Pi , the system processes every input image to output a feature vector, i.e., the corresponding row of the C2 units. All of these feature vectors are used to train the ith SVM classifier (SV Mi ). This operation improves the informativeness of C2 units. That is, a C2 value is computed by taking a “max” over an S2 image with a particular orientation and scale. Therefore, for an input image, the EBIM will generate a feature vector with the dimension of N × S × O, where N is the number of patches, S is the number of scales and O is the number of orientations. After generating C2 units, the algorithm begins a standard Boosting loop process: first choose the best SVM classifier according to the error rate on the training data, then compute classifier weights and update sample weights. Repeat the above procedure until convergence. At the testing stage, for an input image, we first compute its C1 units, and then convolve the C1 units with all selected patches in the training stage to generate the C2 units, which are sent to all SVMs chosen in the training stage for classification. Finally, the EBIM combines SVMs’ outputs with weights learned in the Boosting training process to make a final decision. The feedback procedure chooses the best informative patches with different weights according to their contributions to clas-

Fig. 5. Sample images from CalTech5 database. The last image is a background image.

sification. It significantly reduces the number of prototypes for matching (from ten thousands to about 100 in all our experiments) and thus it greatly speeds up the classification procedure and reduces mismatches. The flowchart of Boostingbased feedback procedure is shown in Algorithm 1.1 Algorithm 1 Boosting-based Feedback FP: false positive rate FN: false negative rate FPMAX: maximum acceptable false positive rate FNMAX: maximum acceptable false negative rate N : the number of samples n: the number of patches selected (xi , yi ): the feature vector and the label of the ith sample wi : the weight of the ith sample Initialize: • Initialize the weight of all training samples: wi = 1/N , i = 1, · · · , N . • Construct SVM classifiers hi (i = 1, 2, · · · , n) using all patches. The patches are selected from the C1 layer of all positive samples. Each of patches corresponds to one SVM classifier. LOOP • Compute all SVM classifiers’ error rates in the weighted training samples. • Keep the classifier ht with the lowest error rate in the two-class classification. • Compute the weight of ht . • Evaluate the performance (FP and FN) on initial training samples (all samples’ weights are equal) using the weighted combination of all kept classifiers. • If FP < FPMAX in the condition of FN < FNMAX End LOOP. Else Re-weight training samples. END Output: selected patches and SVM classifiers and classifiers weights λ.

1 The multi-class problem in our classification can be decomposed to C 2 N two-class classification problems, where N is the number of classes. And then pairwise SVM [50], [51] with majority voting rule is utilized for multi-class object recognition.

HUANG et al.: ENHANCED BIM FOR OBJECT RECOGNITION

Fig. 6.

1673

Accuracy comparison among different sparsity strategies: gradient filter, Laplace filters and DoG filters.

It is worth emphasizing that the proposed Boosting-based feedback is totally different from the Boosting used in [3]. The previous Boosting [3] is used to select and combine the features, i.e., C2 units, for classification. In the proposed EBIM, Boosting implements the feedback procedure that directly affects S2 units by choosing meaningful patches. Because each patch corresponds to at least eight C2 units (see Fig. 4), our patches selection is much more powerful than the feature selection used in [3]. The EBIM requires only a small number of patches (usually 10–100) to obtain a high accuracy. On the contrary, all patches affect the final decision in [3] (usually the BIM needs 1000–5000 patches for stable performance). As a consequence, the Boosting-based feedback component in the EBIM offers 20 times the classification speed of the BIM. In addition, we compare our strategy with the one by Mutch and Lowe [37] which simply drops final features (i.e., C2 units) with small weights by directly setting them to zero (in contrast, the original BIM keeps and combines these features with small weights by Boosting). By repeating this procedure several rounds, the number of selected features will be decreased. Our feedback procedure directly affects S2 units in the way of choosing meaningful patches. Because each patch corresponds to at least eight C2 units, our patches selection is much more powerful than the feature selection in [3] and [37]. Moreover, there are two limitations in the feature selection scheme used in [37]. First, their feature selection algorithm simply removes low weights features by means of setting their weights to zero. This procedure will be repeated several rounds for good performance. However it is difficult to determine the number of iteration rounds, because there is no algorithm to guarantee the convergence and the authors did not give any theoretical or experimental analysis on how many rounds are sufficient for stable performance. Second, it is slow for object classification compared with the EBIM. According to [37], their algorithm typically selects at least 1000 features from about several or ten thousands original features after several rounds. It takes several seconds to process and classify a testing image on a 2 GHz Intel Pentium server. This is because their feature selection cannot select discriminative patches to greatly reduce the redundant features. Our proposed feedback procedure usually chooses 50–100 patches and takes 0.02 s to process an image (128 × 128) for classification. The reduction of the computational cost can make the original BIM applicable to real time applications. IV. E XPERIMENTAL R ESULTS We compare the EBIM with other excellent algorithms on four public image databases: the CalTech5 database [52], the MIT-CBCL Street Scene database [53], the GRAZ database

[54] and the PASCAL Visual Object Classes Challenge 2007 [55]. All the experiments are conducted on a personal computer with Intel core II 2.4GHz CPU and 4GB RAM. The proposed algorithm is implemented with C++, Open CV [56] and IPP [57]. Parameters in our experiments include four orientations (0◦ , ◦ 45 , 90◦ , 135◦ ) and four scales (from 7 × 7 to 13 × 13 with the interval of 2) for Gabor filters, corresponding to four layers of the S1 units and two layers of the C1 units (each layer contains four images corresponding to different orientations). The number of initial patches is detailed in each of the experiments. The SVM used is linear Lib-SVM [51]. A. CalTech5 Object categorization in the Caltech 5 database [52] is a relatively simple task. We employ this database to evaluate each of the two developments and compare the performance among the EBIM, the BIM and the SIFT-based algorithm [3]. This database contains five classes of objects: frontalface, motorcycle, rear-car, airplane and leaf. Fig. 5 shows the sample images of each category. In this experiment we follow the original experimental method in [58]: the images of each category are randomly divided into two equal parts for training and test, respectively. All experiments are repeated ten times and the performance measure is the average EER.2 Experiments are designed as follows: 1) The α is the only adjustable parameter in the EBIM, so we initially test the influence of different α for classification accuracy. In particular, gradient, Laplace and DoG filters are utilized as three sparsity strategies. Fig. 6 shows that gradient filters perform slightly better than Laplace filters and outperform DoG filters. The performance is robust to a wide range of α. Thus, in the following experiments, we adopt the gradient filters-based sparsity strategy and fix α = 3.0. 2) Fig. 7 compares the EBIM against the BIM and the SIFT for object classification and shows that the EBIM is consistently better than the BIM and SIFT. Especially when the number of features is small (about 10–100), the EBIM outplays the BIM and SIFT. In addition, 50–100 features are sufficient for the EBIM to obtain a stable performance for fast object classification. 3) Fig. 8 compares the EBIM against the EBIM in accuracy without one improvement.

2 In all experiments of this paper, EER means the detection rate at equalerror-rate of the ROC curve.

1674

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

Fig. 7. Accuracy comparison among EBIM, BIM and SIFT [3].

Fig. 8. Accuracy evaluation for each improvement. (a) The first row compares the EBIM against the EBIM without the sparsity constraints. (b) The second row compares the EBIM against the EBIM without the feedback procedure.

Fig. 9. Speed evaluation for each improvement. (a) The first row compares the EBIM against the EBIM without the sparsity constraints. (b) The second row compares the EBIM against the EBIM without the feedback procedure.

• The first row compares the EBIM against the EBIM without sparsity constraints. It shows that the sparsity constraint consistently improves the classification accuracy as the number of features varies from 10 to 1000. The experimental results indicate that the proposed sparsity constraint can stably enhance the BIM. • The second row compares the EBIM against the EBIM without the proposed feedback procedure and shows that the feedback procedure consistently increases the object classification accuracy. When the number of features is large, the improvement is slight because a sufficiently large number of features can reflect most properties of the object, but the computation is very huge. As the original features in-

Fig. 10. Sample images from MIT-CBCL street scene database. From left to right, the categories are cars, people and bikes.

crease, the improvement becomes apparent because our feedback procedure can select discriminative patches and optimally combine them for subsequent classification.

HUANG et al.: ENHANCED BIM FOR OBJECT RECOGNITION

1675

TABLE I O BJECT D ETECTION R ESULTS O BTAINED BY S EVERAL S TATE - OF - THE -A RT M ETHODS IN THE E XPERIMENTS OF MIT-CBCL S TREET S CENE DATABASE . “tp@fp = fn” D ENOTES THE T RUE -P OSITIVE -R ATE W HEN FALSE -P OSITIVE -R ATE E QUALS FALSE -N EGATIVE -R ATE. “tp@fp = .01” D ENOTES THE T RUE -P OSITIVE -R ATE W HEN FALSE -P OSITIVE -R ATE I S S ET TO 1%. R ESULTS OF HOG, C1, AND C1+G ESTALT A RE O BTAINED F ROM [38]. ACCURACY OF THE BIM I S O BTAINED F ROM [3] AND THE C OMPUTATION T IME OF THE BIM I S BASED ON O UR I MPLEMENTATION OF THE BIM A LGORITHM IN C++

4) Fig. 9 shows that both the sparsity constraint and the feedback procedure are helpful to reduce the time cost for object classification. • The first row shows the reduction of the computational cost obtained by sparsity constraints. It proves that the sparse images take less time than the dense images in the convolution operation. When the number of features is small, the computational cost cannot be reduced. This is because the increased cost of sparsification is larger than the reduced cost of convolving a sparse image against a dense image. As the features increase, the superiority of the sparsity becomes obvious. • The second row demonstrates that the feedback procedure significantly reduces the computational cost because it directly reduces the number of patches, which is proximately proportional to the whole computation of the BIM. When the number of patches is small, the enhancement of speed is not sensible because the EBIM requires nearly all original patches to achieve the convergence of the Boostingbased feature selection. When the number of patches is large, e.g., 1000, our system finally retains a small part of original patches, e.g., 50. Thus, the speed can be enhanced about 20 times. According to Serre’s description in [2], the BIM requires about 1000–5000 features to reach the plateau. In this case, the speed is greatly improved. For example, the BIM takes approximate about 0.6 s to deal with an image (320 × 240) in the case of randomly sampling 1000 C2 features using the C++ code (implemented by us) and about 30 s using the MATLAB code (implemented by Serre et al. [31]). For the same task, the EBIM needs about 0.025 s using the C++ program. B. MIT-CBCL Street Scene database The MIT-CBCL Street Scene database [53] consists of three classes of objects: cars, pedestrians and bicycles. All images have the same resolution (128 × 128). Example images of each category are shown in Fig. 10. The training and testing images have been divided in the database, each with 1000 images. To conduct this experiment, we use 1500 initial patches. Table I shows the performance of the BIM [3], HoG [26], C1

[3], [38], C1+Gestalt [38] (a successful extension of the BIM) and EBIM. The EBIM achieves the best performance in the classes of cars and bicycles, and is comparable to C1+Gestalt and HOG in the class of pedestrians. Moreover, the EBIM significantly reduces the computation, e.g., it takes 0.02 s to process an image, compared to 0.5 s for the BIM. C. Caltech-101 The Caltech-101 database contains 101 object classes plus a background class. There are about 40 to 800 images per category and most categories have more than 50 images. The size of each image is around 300 × 200. To conduct this experiment, we use 1500 patches with best parameters learned in Caltech5 test. The result reported here are the average and standard deviation, taken over all 101 classes, of the object recognition performance obtained from 10 independent trails. In each trial, 15 images are sampled at random for training and 50 images are sampled at random for testing. The multiclass SVM classifier is applied for classification. Using this protocol, the performance is slightly higher (52.1±1.02%) than the version used in the conference version (49.8±1.2%). Our result outperforms the original BIM (44±1.14%) and Mutch & Lowe’s version (51%). It is worth noting that the Boosting-based feedback algorithm changes a little in using the multi-class SVM. That is, in each round of the Boosting LOOP in Algorithm 1, to choose the best patch, we replace the “lowest error rate in the two-class classification” with “the lowest error rate in the multi-class classification”. D. GRAZ Database The GRAZ database [54], built by Opelt et al., comprised of two challenging data sets. The GRAZ-01 data set contains three classes at various locations, scales and viewpoints. Fig. 11 shows some sample images of each category. To decrease the dependence of background context for classification, they built up the GRAZ-02 data set. The backgrounds of images in all categories on the GRAZ-02 data set are similar. In addition, they increase the complexity of object appearances and add a new category of images. Fig. 12 shows some example images. For the GRAZ-01 data set, we follow the way in [54]: 100 positive samples and 100 negative samples are randomly chosen as training samples. 50 other positive samples and 50

1676

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

Fig. 11. Sample images from GRAZ-01. From left to right, the categories are bikes, people and background.

Fig. 13. Recall-precision curves of several approaches on GRAZ-01. We compare the EBIM with other three approaches reported in [54]: Moment Invariants and the affine invariant interest point detection, SIFTs and DoG interest point detection, and Similarity-Measure-Segmentation (SM) described by intensity distributions on the GRAZ-01 database. TABLE III P ERFORMANCE C OMPARISON OF S EVERAL A PPROACHES ON GRAZ-02. T HE M EASURES A RE EER AND AUC (ROC-A REA U NDER C URVE ). T HE F IRST T HREE ROWS OF P ERFORMANCE A RE R EPORTED IN [54]

Fig. 12. Sample images from GRAZ-02. From left to right, the categories are bikes, persons, cares and background. TABLE II P ERFORMANCE C OMPARISON OF S EVERAL A PPROACHES ON GRAZ-01. T HE M EASURES A RE EER AND AUC (A REA U NDER THE ROC C URVE ). T HE F IRST F OUR ROWS OF P ERFORMANCE A RE R EPORTED IN [54]

other negative samples are chosen as testing samples. In these experiments, we use 1500 initial patches. The experimental results on the GRAZ-01 database are shown in Table II and Fig. 13. The EBIM achieves the best performance in all the cases. For the GRAZ0-02 data set, we also adopt the same strategy used in [54]: 150 positive samples and 150 negative samples are selected at random as training samples. 75 other positive samples and 75 other negative samples are randomly selected as testing samples. In the experiments, we use 1500 initial patches. The EBIM outperforms other approaches in all the classes as shown in Table III and Fig. 14. We also provide the comparison with other recent algorithms. Mutch and Lowe [59] used the ground truth localization for training to offset the less training data (50 images). With these caveats, their whole image classification results (EER) are 80.5% for bikes, 81.7% for cars, and 70.1% for people on

Fig. 14. Comparison of several approaches on GRAZ-02 database. All the results except the EBIM are reported in [54].

Graz-02 data set. Hegazy and Denzler [60] used boosted color local features for object recognition and achieved 80.0% for bikes, 78.62% for cars and 84% for people on Graz-02 data set. Zhang and Dietterich [61] Learned visual dictionaries and decision lists for object recognition. Their experimental results on Graz-01 data set are 76.5% for bikes and 71.7% for persons. E. VOC07 Database Finally, we test our algorithm on the PASCAL Visual Object Classes Challenge 2007 (VOC07) [55], which is one of the most challenging databases for object classification, detection and segmentation. Its goal is to recognize objects from 20 visual object classes in realistic scenes (i.e., without the segmentation

HUANG et al.: ENHANCED BIM FOR OBJECT RECOGNITION

1677

accuracy and speed on computer vision tasks. In this section, we first compare the EBIM against a famous improvement to the BIM proposed by Mutch and Lowe [37], and then provide a computer vision perspective on the EBIM. A. Comparison Against the Improvement by Mutch and Lowe

Fig. 15. Comparison between the EBIM and competitors’ performance in VOC07 database. The X-axis is the index of the 20 object classes. The Y-axis measures the average precision [55]. The red stars are the performance of the EBIM and blue dots are the performance of all the competitors. We use 5000 initial patches.

step). For object categorization task, there are 5011 images for training and 4952 images for testing. Each competitor can choose a part or all of the categories to predict presence/absence of a particular class in a test image. The performance measure is the average precision [55]. We carry out object classification experiments based on the proposed EBIM for all the 20 categories and evaluated its performance compared with other 17 popular algorithms reported in [55]. For general application, e.g., simple twoclass image classification task, EBIM can achieve promising performance in both accuracy and speed with hundreds of patches (see Sections IV-A, B, and D). The VOC07 data set is a very challenging object classification database. Images in the same class vary greatly in viewpoint, illumination, poses, scales et al.. We need many patches to capture the local patterns of objects for high performance. In the experiment of VOC07, we use 5000 initial patches for Boosting-based feedback. Fig. 15 shows the performance of the EBIM and 17 participants’ algorithms. The X-axis denotes 20 categories: aeroplane, bicycle, bird, boat, bottle, bus, car, cat, chair, cow, dining-table, dog, horse, motorbike, person, potted-plant, sheep, sofa, train and tv-monitor. The top two algorithms, which are INRIA-genetic and INRIA-flat [62], perform consistently, i.e., they usually achieve best performances in all categories. For other algorithms, they perform in different ways, i.e., different methods outperform the others for some specific classes. For example, UVA-MCIP algorithm [55] ranks the fifth in the chair class but ranks the tenth in the bird class. In comparison with these 17 algorithms, the EBIM (the red points in Fig. 15) ranks the third as a whole and its average classification precision is 0.563. V. D ISCUSSION It is necessary to emphasize that the EBIM is biologically inspired, i.e., the motivation of each improvement comes from relevant research findings in cognitive science, but the success of the EBIM is based on the performance comparison in both

The proposed sparsification and feature selection are different from the ones introduced by Mutch and Lowe and the standard feature selection used in the BIM, respectively. Details are given in the succeeding section. 1) Sparsification: • The original BIM processes all pixels in all orientations in the S1 layer and the C1 layer, and then sums the responses of all orientations in the S2 layer. • The algorithm introduced by Mutch and Lowe adopts the same strategy in the S1 layer and the C1 layer but preserves only the maximum response of all orientations in the S2 layer. This is their version of sparsification. • Our proposed sparsification totally differs from the one by Mutch and Lowe. It, in the S1 layer, removes uninformative pixels, i.e., pixels with small gradient responses in the S1 layer. To quantitatively compare these two sparsification schemes, we conduct experiments on the Caltech5 database. Experimental results in Fig. 16 show that our sparsification consistently outperforms the original BIM and Mutch and Lowe’s method. Moreover, the sparsification by Mutch and Lowe weakens the discrimination of patches as well as the effectiveness of feature selection, which will be explained later on. 2) Feature Selection: • Fig. 17 shows that the BIM uses the standard GentleBoost for C2 feature selection. The final outputs is the weighted sum of selected C2 features. • The algorithm by Mutch and Lowe uses the multi-round SVM for the C2 feature selection. In each round, SVM is trained using the retained C2 features, and then at most 50% features with low scores are eliminated. This operation is repeated until the rest number of C2 feature decreases to a predefined value. Fig. 18 shows the above procedure. • Our feature selection works on the “patch” layer (see Fig. 19). Each patch produces eight C2 units to train an SVM. Thus, N patches generate N SVMs, each of which constructs a weak learner in the Boosting process. In the Boosting process, our feedback each time chooses the best SVM corresponding to the most discriminative patch, and calculates its weight according to the performance of the chosen SVM. The final decision is the weighted sum of the outputs of the chosen SVMs. Our feature selection scheme performs better than the other two in speed. EBIM improves the power of each patch because each patch generates eight C2 units (by four orientations and two scales) rather than one C2 unit in their algorithms. Moreover, these eight C2 units are combined by training using SVM, which further enhances patches’ discrimination. As a result, we need much less patches to train our model, usually less than 1%

1678

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

Fig. 16. Experimental comparison among the BIM, sparsification by Mutch and Lowe and sparsification by us.

Fig. 19. paper. Fig. 17. Framework of standard feature selection in the BIM.

Framework of the middle level feature selection proposed in this

patch in their system is not as strong as that in our system. Thus, their system is still slow (needs more patches to obtain a stable performance) after feature selection. B. Computer Vision Perspective on the EBIM

Fig. 18. Framework of feature selection proposed by Mutch and Lowe.

of the BIM and 10% of Mutch and Lowe’s algorithm. Because the computation of the system is approximately proportional to the number of finally preserved patches, our system runs much faster than theirs. In addition, Mutch and Lowe’s algorithm has two limitations: First, it is not adaptive. It requires a predefined value to decide the rounds of feature selection, i.e., when to stop the iteration of dropping features. However, different tasks require different number of features. For example, it may needs 100 patches in the Caltch5 database, while more than 2000 patches in the Caltch101 database. It is very possible that a fixed number would results in either an under-fitting or an over-fitting. Second, because of their sparsification (only preserve the maximum response of the S1 units), the discrimination of each

In this section, we analyze the differences of the above three feature selection strategies, and explore the relationship between the sparsification and the feature selection. 1) What Are Theoretical Insights of the BIM, the EBIM and Mutch and Lowe’s Model?: In the processing from the S2 layer to the C2 layer, the operations adopted in the BIM and Mutch and Lowe’s model are intrinsically two special cases of the EBIM. The original BIM sums the responses of all S2 units produced by one patch. This is equal to assign weights to all S2 units equally. Mutch and Lowe’s sparsification preserves the maximum response of all S2 units produced by one patch. This is equivalent to assign a weight of 1 to the maximum S2 unit and others 0. Our system preserves all S2 units (from different orientations and scales) and combines them with weights learned by SVMs in the hierarchical feature selection process (see Fig. 19). Thus, the schemes adopted in the original BIM and Mutch and Lowe’s algorithm are two special cases of our hierarchical feature selection process. Actually, we propose an optimal method to make use of the information from all S2 units. 2) There Is No Free Lunch. What We Lose and What We Obtain by Using the Proposed Feature Selection?: The proposed EBIM requires more time to train SVM models. Since the input to each SVM in our feature selection is of eight dimensions, training each SVM is fast. Usually, it takes several minutes to

HUANG et al.: ENHANCED BIM FOR OBJECT RECOGNITION

several hours to complete the whole training process. This is practically acceptable. However, by this strategy, we can greatly reduce the required patches, so as to largely speed up the system in testing. VI. C ONCLUSION In this paper, we have presented an enhanced biologically inspired model (EBIM) with two new functional components to improve Serre’s biologically inspired model (BIM). These two particular components are sparsity constraints and a Boostingbased feedback procedure. They solve two particular limitations in the BIM, respectively: mismatches due to dense inputs and blind feature selection in the feedforward framework. Experiments on four different kinds of databases have demonstrated that the EBIM performs much better than the BIM in both accuracy and speed for object categorization. Although the EBIM does not achieve the best in all tested data sets, the results are comparable to the state-of-the-art algorithms, and exhibit the prospect of BIM. It is worth emphasizing that the EBIM is at least 20 times faster than the BIM. It can process about 50 images (128 × 128) per second for object categorization and is promising for real-time applications. In the future, we would like to further improve the EBIM in the following aspects: • First, the sparsity constraints can be moved to an earlier stage, e.g., in the very first of the system, to further speed up the system. And more sparsity strategies can be considered besides the gradient and Laplace filters. • Second, recent variants of Boosting schemes have been demonstrated to be more effective than the conventional one. Thus, considering the specific case of the EBIM and replacing the conventional Boosting can obtain additional benefits. • Finally, it is also a promising direction to research the machine learning explanation of the BIM and its extensions. There is still a gap between biological visual systems and computer vision systems. ACKNOWLEDGMENT We would like to express our sincere thanks and deep gratitude to the associate editor and the anonymous reviewers for their valuable suggestions which have greatly improved the quality of the manuscript. R EFERENCES [1] Y. Z. Huang, K. Q. Huang, D. C. Tao, L. S. Wang, T. N. Tan, and X. L. Li, “Enhanced biological inspired model,” in Proc. CVPR, 2008, pp. 1–8. [2] T. Serre, L. Wolf, and T. Poggio, “Object recognition with features inspired by visual cortex,” in Proc. CVPR, 2005, pp. 994–1000. [3] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust object recognition with cortex-like mechanisms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp. 411–426, Mar. 2007. [4] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D. Tolliver, N. Enomoto, and O. Hasegawa, “A system for video surveillance and monitoring,” Robot. Inst., Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep.: CMU-RI-TR-00-12, 2000. [5] W. M. Hu, T. N. Tan, L. Wang, and S. Maybank, “A survey on visual surveillance of object motion and behaviors,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 34, no. 3, pp. 334–352, Aug. 2004.

1679

[6] A. Yoshitaka and T. Ichikawa, “A survey on content-based retrieval for multimedia databases,” IEEE Trans. Knowl. Data Eng., vol. 11, no. 1, pp. 81–93, Jan./Feb. 1999. [7] A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H. J. Zhang, “Image classification for content-based indexing,” IEEE Trans. Image Process., vol. 10, no. 1, pp. 117–130, Jan. 2001. [8] W. M. Hu, D. Xie, Z. Y. Fu, W. R. Zeng, and S. Maybank, “Semanticbased surveillance video retrieval,” IEEE Trans. Image Process., vol. 16, no. 4, pp. 1168–1181, Apr. 2007. [9] R. Kosala and H. Blockeel, “Web mining research: A survey,” ACM SIGKDD Explorations Newslett., vol. 2, no. 1, pp. 1–15, Jul. 2000. [10] V. I. Pavlovic, R. Sharma, and T. S. Huang, “Visual interpretation of hand gestures for human-computer interaction: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 7, pp. 677–695, Jul. 1997. [11] A. K. Jain, A. Ross, and S. Prabhakar, Eds., Biometrics: Personal Identification in Networked Society. Norwell, MA: Kluwer, 1999. [12] A. K. Jain, A. Ross, and S. Prabhakar, “An introduction to biometric recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 1, pp. 4–20, Jan. 2004. [13] W. Niblack, R. Barber, W. Equitz, M. Fickner, E. Glasman, D. Petkovic, and P. Yanker, “The QBIC project: Querying images by content using color, texture and shape,” in Proc. SPIE—Storage and Retrieval for Image and Video Databases, 1993, pp. 173–187. [14] B. Schiele and J. Crowley, “Recognition without correspondence using multidimensional receptive field histograms,” Int. J. Comput. Vis., vol. 36, no. 1, pp. 31–50, Jan. 2000. [15] B. Leibe and B. Schiele, “Interleaved object categorization and segmentation,” in Proc. Brit. Mach. Vis. Conf., 2003, pp. 1–10. [16] H. Murase and S. K. Nayar, “Visual learning and recognition of 3-d objects from appearance,” Int. J. Comput. Vis., vol. 14, no. 1, pp. 5–24, Jan. 1995. [17] M. Turk and A. P. Pentland, “Eigenfaces for recognition,” J. Cognitive Neurosci., vol. 3, no. 1, pp. 71–96, Winter 1991. [18] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. 4th Alvey Vis. Conf., 1988, pp. 147–151. [19] T. Tuytelaars and L. V. Gool, “Matching widely separated views based on affine invariant regions,” Int. J. Comput. Vis., vol. 59, no. 1, pp. 61–85, Aug. 2004. [20] K. Mikolajczyk and C. Schmid, “Scale and affine invariant interest point detectors,” Int. J. Comput. Vis., vol. 60, no. 1, pp. 63–86, Oct. 2004. [21] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,” Image Vis. Comput., vol. 22, no. 10, pp. 761–767, Sep. 2004. [22] P. Viola and M. Jones, “Robust real-time object detection,” in Proc. IEEE Workshop Statist. Comput. Theor. Vis., 2001, pp. 1–25. [23] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004. [24] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1615–1630, Oct. 2005. [25] S. Lazebnik, C. Schmid, and J. Ponce, “A sparse texture representation using local affine regions,” Beckman Inst., Univ. Illinois, Urbana, IL, Tech. Rep. CVR-TR-2004-01, 2004. [26] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. CVPR, 2005, pp. 886–893. [27] P. F. Felzenszwalb and D. Huttenlocher, “Pictoral structures for object recognition,” Int. J. Comput. Vis., vol. 61, no. 1, pp. 55–79, Jan. 2005. [28] G. Bouchard and B. Triggs, “Hierarchical part-based visual object categorization,” in Proc. CVPR, 2005, pp. 710–715. [29] J. Amores, N. Sebe, and P. Radeva, “Context-based object class recognition and retrieval by generalized correlograms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 10, pp. 1818–1833, Oct. 2007. [30] G. Csurka, C. Bray, C. Dance, and L. Fan, “Visual categorization with bags of keypoints,” in Proc. ECCV, 2004, pp. 1–16. [31] [Online]. Available: Http://cbcl.mit.edu/software-datasets/index.html [32] J. P. Jones and L. A. Palmer, “An evaluation of the two-dimensional gabor filter model of simple receptive fields in cat striate cortex,” J. Neurophysiol., vol. 58, no. 6, pp. 1233–1258, Dec. 1987. [33] M. Riesenhuber and T. Poggio, “Hierarchical models of object recognition in cortex,” Nat. Neurosci., vol. 2, no. 11, pp. 1019–1025, Nov. 1999. [34] T. Serre, M. Kouh, C. Cadieu, U. Knoblich, G. Kreiman, and T. Poggio, A theory of object recognition: Computations and circuits in the feedforward path of the ventral stream in primate visual cortex, MIT, Cambridge, MA, MIT-CSAIL-TR-2005-082. [Online]. Available: http://publications.csail.mit.edu/tmp/MIT-CSAIL-TR-2005-082.pdf

1680

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 41, NO. 6, DECEMBER 2011

[35] T. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer. A, Opt. Image Sci., vol. 20, no. 7, pp. 1434– 1448, Jul. 2003. [36] L. Wolf, S. Bileschi, and E. Meyers, “Perception strategies in hierarchical vision systems,” in Proc. CVPR, 2006, pp. 2153–2160. [37] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse, localized features,” in Proc. CVPR, 2006, pp. 11–18. [38] S. Bileschi and L. Wolf, “Image representations beyond histograms of gradients: The role of gestalt descriptors,” in Proc. CVPR, 2007, pp. 1–8. [39] M. M. Jimenez and N. P. de la Blanca, “Empirical study of multi-scale filter banks for object categorization,” in Proc. ICPR, 2006, pp. 578–581. [40] E. Meyers and L. Wolf, “Using biologically inspired features for face processing,” Int. J. Comput. Vis., vol. 76, no. 1, pp. 93–104, Jan. 2008. [41] B. Olshausen and D. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, Jun. 1996. [42] S. Thorpe, D. Fize, and C. Marlot, “Speed of processing in the human visual system,” Nature, vol. 381, no. 6582, pp. 520–522, Jun. 1996. [43] S. Thorpe and M. Fabre-Thorpe, “Seeking categories in the brain,” Science, vol. 291, no. 5502, pp. 260–263, Jan. 2001. [44] R. van Rullen and C. Koch, “Visual selective behavior can be triggered by a feed-forward process,” J. Cogn. Neurosci., vol. 15, no. 2, pp. 209–217, Feb. 2003. [45] C. Keysers, D. k. Xiao, P. Földiák, and D. I. Perrett, “The speed of sight,” J. Cogn. Neurosci., vol. 13, no. 1, pp. 90–101, Jan. 2001. [46] S. Hochstein and M. Ahissar, “View from the top: Hierarchies and reverse hierarchies in the visual system,” J. Neuron, vol. 36, no. 5, pp. 791–804, Dec. 2002. [47] G. Murphy and D. Medin, “The role of theories in conceptual coherence,” Psychol. Rev., vol. 92, no. 3, pp. 289–316, Jul. 1985. [48] R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol. 5, no. 2, pp. 197–227, Jun. 1990. [49] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to Boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, Aug. 1997. [50] T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” Ann. Statist., vol. 26, no. 2, pp. 451–471, 1998. [51] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/ libsvm/ [52] [Online]. Available: http://www.robots.ox.ac.uk/~vgg/data3.html [53] [Online]. Available: Http://cbcl.mit.edu/software-datasets/streetscenes [54] A. Opelt, A. Pinz, M. Fussenegger, and P. Auer, “Generic object recognition with Boosting,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 3, pp. 416–431, Mar. 2006. [55] [Online]. Available: Http://pascallin.ecs.soton.ac.uk/challenges/VOC/ voc2007 [56] [Online]. Available: Http://opencv.willowgarage.com [57] [Online]. Available: Http://www.intel.com/cd/software/products/asmona/eng/302910.htm [58] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” in Proc. CVPR, 2003, pp. II264–II-271. [59] J. Mutch and D. G. Lowe, “Object class recognition and localization using sparse features with limited receptive fields,” in Proc. IJCV, 2008, pp. 45–57. [60] D. Hegazy and J. Denzler, “Boosting colored local features for generic object recognition,” Pattern Recog. Image Anal., vol. 18, no. 2, pp. 323– 327, Jun. 2008. [61] W. Zhang and T. G. Dietterich, “Learning visual dictionaries and decision lists for object recognition,” in Proc. ICPR, 2008, pp. 1–4. [62] M. Marszalek, C. Schmid, H. Harzallah, and J. Weijer, “Learning representations for visual object class recognition,” in Proc. ICCV—Workshop Visual Recognition Challenge, 2007, pp. 1–20.

Yongzhen Huang (S’08) received the B.E. degree from HuaZhong University of Science and Technology in 2006, and the Ph.D. degree from Institute of Automation, Chinese Academy of Sciences in 2011. His current research interests include pattern recognition, computer vision and biologically inspired vision computing.

Kaiqi Huang (SM’10) received the B.Sc. and M.Sc. from Nanjing University of Science Technology, China, in 1998 and 2001 respectively, and obtained the Ph.D. degree from Southeast University, in 2004. He has worked in National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Science, China and he has been an associate professor in NLPR since 2005. He was the deputy general secretary of IEEE Beijing Section from 2006 to 2008.

Dacheng Tao (M’07) is a Professor of Computer Science with the Centre for Quantum Computation and Information Systems and the Faculty of Engineering and Information Technology in the University of Technology, Sydney. He mainly applies statistics and mathematics for data analysis problems in data mining, computer vision, machine learning, multimedia, and video surveillance. He has authored and co-authored more than 100 scientific articles at top venues including IEEE T-PAMI, T-KDE, T-IP, NIPS, ICML, ICDM, AISTATS, IJCAI, AAAI, CVPR, ECCV, ACM T-KDD, and KDD, with the best theory/algorithm paper runner up award in IEEE ICDM’07.

Tieniu Tan (F’04) received the B.Sc. degree in electronic engineering from Xi’an Jiaotong University, China, in 1984, and the M.Sc. and Ph.D. degrees in electronic engineering from Imperial College of Science, Technology and Medicine, London, in 1986 and 1989, respectively. In October 1989, he joined the Computational Vision Group, Department of Computer Science, University of Reading, United Kingdom, where he worked as a Research Fellow, a Senior Research Fellow, and a Lecturer. In January 1998, he returned to China to join the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, where he is currently a Professor and the Director of NLPR and is a former DirectorGeneral of the institute (2000-2007). He is also a Deputy Secretary-General of CAS. He has published more than 300 research papers in refereed journals and conference proceedings in the areas of image processing, computer vision, and pattern recognition. His current research interests include biometrics, image and video understanding, information hiding, and information forensics. He has given invited talks at many universities and international conferences. He is or has served as an associate editor or a member of the editorial board of many leading international journals, including the IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, IEEE T RANSACTIONS ON AUTOMATION S CIENCE AND E NGINEERING , IEEE T RANSACTIONS ON I NFORMATION F ORENSICS AND S ECURITY, IEEE T RANSACTIONS ON C IR CUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY, PATTERN R ECOGNITION , Pattern Recognition Letters, Image and Vision Computing, etc. He is the editorin-chief of the International Journal of Automation and Computing and Acta Automatica Sinica. He has served as the chair or a program committee member for many major national and international conferences. He is the vice chair of the International Association of Pattern Recognition (IAPR) and a founding chair of the IAPR/IEEE International Conference on Biometrics (ICB) and the IEEE International Workshop on Visual Surveillance. He is the IAPR and a member of the IEEE Computer Society. He currently serves as the executive vice president of the Chinese Society of Image and Graphics.

Xuelong Li (SM’07) is a Professor with the Center for OPTical IMagery Analysis and Learning, State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China.