Neurocomputing 74 (2011) 2929–2940
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Multi-scale gist feature manifold for building recognition Cairong Zhao a,b,n,1, Chuancai Liu a, Zhihui Lai a,1 a b
School of Computer Science, Nanjing University of Science and Technology, Nanjing, Jiansu 210094, China Department of Physics and Electronics, Minjiang College, Fuzhou, Fujian 350108, China
a r t i c l e i n f o
abstract
Article history: Received 30 July 2010 Received in revised form 4 March 2011 Accepted 31 March 2011 Communicated by Qi Li Available online 23 May 2011
Multi-scale gist (MS-gist) feature manifold for building recognition is presented in the paper. It is described as a two-stage model. In the first stage, we extract the multi-scale gist features that represent the structural information of the building images. Since the MS-gist features are extrinsically high dimensional and intrinsically low dimensional, in the second stage, an enhanced fuzzy local maximal marginal embedding (EFLMME) algorithm is proposed to project MS-gist feature manifold to lowdimensional subspace. EFLMME aims to preserve local intra-class geometry and maximize local interclass margin separability of MS-gist feature manifold of different classes at the same time. To evaluate the performance of our proposed model, experiments were carried out on the Sheffield buildings database, compared with the existing works: (a) the visual gist based building recognition model (VGBR) and (b) the hierarchical building recognition model (HBR). Moreover, EFLMME is evaluated on Sheffield buildings database compared with some linear dimensionality reduction methods. The results show that the proposed model is superior to other models in practice of building recognition and can handle the building recognition problem caused by rotations, variant lighting conditions and occlusions very well. & 2011 Elsevier B.V. All rights reserved.
Keywords: Multi-scale gist feature Fuzzy gradual neighbor graph Building recognition Manifold learning
1. Introduction Object recognition is an important area in computer vision research. There is a large amount of literatures [1–6] on general object recognition. As a relatively specific recognition task in object recognition, building recognition has been somewhat neglected. Developing a computational model of building recognition is difficult because building images may be taken from different viewpoints (please see Figs. 3(c) and 4(a)), under different lighting conditions (please see Fig. 4(c), or suffered from occlusions from trees, moving vehicles, other buildings or themselves (see Figs. 3(a) and 4(b)). More details can be seen in the experiment section. This domain becomes more and more interesting to researchers since building recognition can be utilized in many practical application areas, such as architectural design, building labeling in videos, robot vision or localization [7] and mobile device navigation [8,9]. Hutchings and Mayol-Cuevas [9] proposed to use the Harris corner detector [10] based on the local auto-correlation function to extract interest points for matching buildings in the world
n Corresponding author at: School of Computer Science, Nanjing University of Science and Technology, Nanjing, Jiansu 210094, China. Tel.: þ 86 15850559315; fax: þ 8602584315751. E-mail address:
[email protected] (C. Zhao). 1 The authors contributed equally to this work.
0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.03.035
space for mobile device. In [11], perceptual grouping rules were applied to explore the semantic relationships between different primitive image features. And based on these rules, a Bayesian framework was proposed for content-based image retrieval (CBIR) for buildings. Stassopoulou [12] proposed to use Bayesian Networks for detecting and recognizing buildings from digital orthophotos. The semantic relationships among low-level visual features were explored using the perceptual grouping in a hierarchical way, which was as closely as possible to fit human behavior. Li et al. [13] proposed to use SIFT [14] descriptors to extracting distinctive invariant features from images, which provides reliable matching between different views of objects or scenes. Fritz et al. [15] proposed to apply the ‘Informative Descriptor Approach’ on SIFT features (i-SIFT descriptors) to provide both robust building detection and recognition by extracting color, orientation and spatial information for each line segment. Zhang and Kosecka [16] proposed a hierarchical approach for building recognition based on localized color histograms in the first step and refined matching SIFT descriptors in the second step. With the quick retrieval of a small number of candidate buildings using localized color histograms, this method achieved some improvement in efficiency in building image retrieval. Nevertheless, building recognition systems mentioned above suffer from the detection of low-level vision features [18], e.g., line segments, vanishing points, etc. Since low-level features cannot reveal the truly semantic concepts of images, Li et al. [17],
2930
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
proposed a building recognition model integrating low-level vision features into mid-level features, which attempt to reveal the truly semantic concepts of the images. Due to approximately semantic feature extraction, this method obtains some improvement in building recognition. By pooling together the activity of local low-level feature detectors across large regions of visual field, Oliva and Schyns [19] and Oliva and Torralba [20] built a holistic and lowdimensional representation of the structure of a scene. High-level properties of an image such as the degree of perspective have been found to be correlated with the configuration of low-level image features [21–23]. Evidence from the psychophysics literature suggest that our visual system analyzes global statistical summary of the image in a pre-selective stage of visual processing or at least, with minimal attentive resources [24]. This suggests that a reliable gist representation of building also can be abstracted from spatial information by performing 2D Fourier transform analyses in image. Series of behavioral experiments [25–27] showed that visual input was processed at different spatial scales (from low to high spatial frequency), which offer different qualities of information for recognition purpose. Additional experiments [28] showed that the visual system could select a certain spatial scales to process depending on task constraints. As a specific recognition task in pattern recognition, the building recognition system may capture the holistic information of building at different scales. Based on these observations, we propose to use multi-scale gist (MS-gist) feature to describe building image, which is the first contribution of this paper. Since the MS-gist features of a building image taken from different viewpoints or under different lighting conditions are extrinsically high dimensional and intrinsically low dimensional, it is necessary for us to obtain the more compact representation of the gist features by further dimensionality reduction. Researchers have developed many useful dimensionality reduction techniques [8,29–36]. Principal component analysis (PCA) [32] and linear discriminant analysis (LDA) [30,33] are two well-known linear subspace learning methods. PCA projects the data along the direction of the largest variance of the data; while LDA maximizes the ratio of the between-class scatter to the within-class scatter. But both of them do not take the local geometric structure of the feature manifold into account. Recently, manifold learning becomes an important field in pattern recognition and machine learning. Manifold learning algorithms for dimensionality reduction aims at discovering and preserving the intrinsic low-dimensional compactness representation of the high-dimensional data. The most well-known linear dimensionality reduction method based on manifold learning is locality preserving projections (LPP) [34], which learns a linear subspace approximating the eigenfuctions of Laplacian Eigenmap [35]. The geometric structure based these methods frequently use the K-nearest graph to model the local structure of the dataset. The graphs used in the algorithms are very important since each element of the graphs reflects the relationship between two data points. However, a common drawback of existing graph-based manifold learning algorithm is that the graphs are constructed with Gaussian kernel function or binary pattern in local K-nearest neighborhood data points. Since the distances between the samples in local K-nearest neighbors might also vary in a big range, the graphs constructed in this way might have potential disadvantages that the weights are not in accordance with the nature relations of the data in actual applications. Therefore we present a new approach called enhanced fuzzy local maximal marginal embedding (EFLMME). In EFLMME, two novel graphs called enhanced fuzzy neighborhood graphs are constructed by fuzzy k-nearest neighbor (FKNN) method [37,38]. The fuzzy k-nearest neighbor method is utilized to model the nature distribution information of original samples,
and this information is utilized to redefine the affinity weights of the intra-class and the interclass neighborhood graphs instead of using the binary pattern or Gaussian kernel function. With the two novel graphs, an enhanced fuzzy local maximal marginal embedding algorithm (EFLMME) is proposed to further reduce the dimensionality of multi-scale gist feature manifold. This is the second contribution of this paper. Conclusively, in this paper, we mainly focus on the study of effective and robust feature extraction in the building recognition model. Since the features are sampled from a low-dimensional manifold embedded in high-dimensional space [39], it is essential to obtain a suitable mapping to implement the dimensionality reduction. The proposed model is described as a two-stage model. In the first stage, we capture MS-gist feature representation of the structure of the building images. In the second stage, an enhanced fuzzy local maximal marginal embedding algorithm (EFLMME) is proposed to further reduce dimensionality of MS-gist feature manifold. To evaluate performance of our proposed model, experiments were carried out on Sheffield buildings database, compared with existing works: (a) the visual gist based building recognition model (VGBR): each image is described by a gist feature based on visual features [17]; (b) the hierarchical building recognition model (HBR): each image is represented by a local color histograms indexing vector [16]. At the same time, EFLMME is evaluated on Sheffield buildings database compared with PCA, LDA, LPP and the recently proposed DLA. The results show that the proposed model is an effective method. The remainder of this paper is organized as follows: Section 2 describes MS-gist feature representation of the building image. Section 3 introduces enhanced fuzzy local maximal marginal embedding algorithm. Section 4 evaluates the recognition performance on Sheffield building database. Section 5 presents the conclusions.
2. MS (multi-scale)-gist feature representation 2.1. Outline of the holistic representation of the spatial envelope Evidence from the psychophysics literature suggests that human visual system analyzes global statistical summary of the image in a pre-selective stage of visual processing [24]. By pooling together the activity of local low-level feature detectors across large regions of visual field, Oliva and Torralba [20] and Oliva [21] built a holistic and low-dimensional representation of the structure of a scene. Here, it is introduced briefly as follows. Let X be a color image with resolution of N M, and let its three color components be R, G and B. The image is converted to one intensity image iðx,yÞ ¼
1 ðR þ G þ BÞ 3
ð1Þ
Next, the intensity image was discrete Fourier transformed. It is defined as Iðfx ,fy Þ ¼
N1 X X M1
iðx,yÞhðx,yÞej2pðfx x þ fy yÞ
ð2Þ
x¼0y¼0
where fx and fy are the spatial frequency variables. h(x,y) is a circular Hanning window to reduce boundary affects. A(fx,fy)¼ 9I(fx,fy)9 is the Fourier amplitude spectrum that is a N (M/2 1) element array of real numbers. Clearly, the amplitude spectrum provides high dimensional representations of the input image. Dimensionality reduction is achieved by the principal component analysis (PCA) by considering only the KL functions that account for the maximal variability of the signal described. Therefore, the
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
2931
Series of behavioral experiments [25–27] showed that visual input was processed at different spatial scales. Different spatial scales offer different qualities of information for recognition purpose. As a specific recognition task in pattern recognition, the holistic information of building images at different scales may be helpful to improve building recognition. Based on this observation, we propose a MS-gist representation for building recognition. The procedure is described in detail as follows. Firstly, the color building image with resolution of N M is converted to one intensity image. Then we decompose the intensity image at multiple spatial scales
acts as a local constant that avoids noise enhancement in constant image regions. We set experimentally e ¼20 for input images with intensity values in the range [0,255]. This pre-filtering stage affects only the very low spatial frequencies (below 0.015 c/p) and does not change the mean spectral signature. Compared with the local normalization method in [46], the image change after pre-filtering is shown in Fig. 1. From Fig. 1(c), we can see that the histogram is unimodal distribution after the proposed pre-filtering on the intensity image. Compared with the multimodal distribution of the histogram in Fig. 1(f), the unimodal distribution is more beneficial to later classification tasks. It is clear that the proposed pre-filtering stage achieves good normalizing result and better reduces the distribution of illumination. After pre-filtering, for different scale intensity images, the corresponding spatial distribution of spectral information can be described by discrete Fourier transform
p0 ðx,yÞ ¼ iðx,yÞ ps ðx,yÞ ¼ ðk2Þps1 ðx,yÞ,s ¼ 0,. . .,S
Ps ðfx ,fy Þ ¼
coefficients of decompositions are obtained as the holistic and low-resolution representation of the input image. 2.2. The MS-gist feature representation
ð3Þ
ps ðx,yÞ hðx,yÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
p0s ðx,yÞhðx,yÞej2pðfx x þ fy yÞ
ð5Þ
x¼0y¼0
where (k2) refers to the down-sampling operation and s represents the spatial scale. By experiment we found that the number of spatial scale S ¼1 could achieve satisfying results. In order to reduce illumination effects and prevent some local image regions to dominate the energy spectrum, the pre-filtering should be applied to intensity building image at different scales. Differing from the subtraction with mean intensity of the image and division with standard deviation of the intensity image [46], the proposed pre-filtering consists in a local normalization of the intensity variance p0s ðx,yÞ ¼
N 1 M 1 X X
ð4Þ
e þ ½ps ðx,yÞ hðx,yÞ2 gðx,yÞ
where g(x,y) is an isotropic low-pass Gaussian spatial filter with a radial cut-off frequency at 0.015 c/p (cycles/pixel) and h(x,y)¼1 g(x,y). The numerator is a high-pass filter that cancels the mean intensity value of the image and whitens the energy spectrum at the very low spatial frequencies. The denominator
where fx and fy are the spatial frequency variables. h(x,y) is a circular Hanning window. The energy spectrum, As ðfx ,fy Þ2 ¼ 2 9Ps ðfx ,fy Þ9 , provides structural information that represents the spatial frequencies spread everywhere in the image and thus information about the orientation, smoothness, length and width of the contours that compose the building image. Therefore, the energy spectrum provides a building representation invariant with respect to object arrangements and object identities, encoding only the dominant structural patterns present in the image. The dimensionality of the energy spectrum is N M. Obviously, the energy spectrum information provides high dimensional representations of the input image. A way to reduce the dimensionality is to reduce the number of samples of the energy spectrum function. The dimensionality reduction function is defined as ZZ gsi ¼ /A2s ,Gi S ¼ As 2 ðfx ,fy ÞGi ðfx ,fy Þdfx dfy ð6Þ
Fig. 1. From left to right, (a) the intensity image of input image; (b) the output of the proposed pre-filtering to the intensity image; (c) the histogram of (b); (d) the value image (HSV) of input image; (e) the result of the normalizing the value image; (f) the histogram of (e).
2932
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
Fig. 2. Gist feature extraction procedure for a color building image ((1) convert the input image to the intensity image; (2)the pre-filtering stage; (3) extract the 32 spectrum maps at 4 frequency bands and 8 orientation; (4) sample each spectrum map at 4 4 grid sub-regions; (5) reshape the means of the sub-regions and get the gist feature).
Gi is a set of Gaussian functions arranged in a log-polar array and calculated by rotating and scaling the function
high-dimensional MS-gist features to a low-dimensional subspace for classification.
2 2 2 2 2 2 Gðfx ,fy Þ ¼ efy =sy ðeðfx fo Þ =sx þ eðfx þ fo Þ =sx Þ
3.1. Overview of recent works on dimensionality reduction
ð7Þ
We use a set of Gaussians distributed in 4 frequency bands with central frequencies fo at 0.02, 0.04, 0.08 and 0.16 and c/p and 8 orientations at each band. So, there are 32 Gabor filters of varying frequency bands and orientations. The energy spectrum A2s ðfx ,fy Þ is then represented by a series of maps gs ¼ {gsi}i ¼ 1,L with L¼32. The spectrum map gs can be sampled with coarsely resolution, which benefit to classification tasks. This was testified in [20]. So, we decompose each gsi into 4 4 grid sub-regions. Then the mean value of each sub-region is calculated for final representation, i.e. 16 mean values are utilized to represent each gsi and 32 16 ¼ 512 5 ðN MÞ values of gist feature are obtained for image representation at different scales. The whole procedure is shown in Fig. 2. Following these procedure at different scales, we obtain a MS-gist feature of each image. Then, we reshape the MS-gist feature map in a column vector g0 . Finally, each building image is represented by a column vector g0 and building dataset can be described as a MS-gist feature space. As it is illustrated in the literatures [34,35,40–44], since the high-dimensional vector g0 are obtained from the building images taken from different viewpoints or under different lighting conditions, g0 s of a building lie on a sub-manifold, which is extrinsically high dimensional and intrinsically low dimensional. Therefore, it is necessary for us to obtain the more compact low-dimensional representation of the gist feature manifold using more effective dimensionality reduction techniques. The following section will focus on developing a more effective linear dimensionality reduction method to enhance the performance on building recognition.
3. Enhanced fuzzy local maximal margin embedding (EFLMME) In order to remove some redundant and useless information and alleviate computational complexity in building recognition while preserving sufficient discriminative information in the subsequent recognition stage, dimension reduction technology is applied to the MS-gist features of the building images. Here, the high-dimensional ambient space is the MS-gist feature space. Our objective is finding an optimal mapping to project the
There are large amount of dimensionality reduction techniques developed in the literatures. Among them, there are two main fields: conventional linear dimensionality reduction techniques and graph-based manifold learning algorithms. Two of the conventional linear dimensionality reduction methods are principal component analysis (PCA) [32] and linear discriminant analysis (LDA) [30,33], which were widely used in face image recognition and called as Eigenfaces and Fisherfaces, respectively. PCA, which is an unsupervised method, aims to preserve total variances by maximizing the trace of feature variance matrix. PCA may be unsuitable for classification tasks since the label information is not taken into account. As a supervised learning algorithm, LDA aims to maximize betweenclass scatter matrix and minimize the within-class scatter matrix simultaneously. Thus, LDA usually performs better than PCA in classification tasks. However, for a c-class classification task, if the dimension of the projected subspace is strictly lower than c 1, the projection to a subspace tends to merge those classes that are close together in the original feature space. Aiming to solve the class separation problem, Tao et al. [40] replaced the arithmetic mean by a general mean function and proposed a general averaged divergence analysis algorithm. By choosing different mean functions, a series of subspace selection algorithms are obtained. The geometric means for subspace selection [41], which maximizing the geometric mean of KL divergences between different class pairs are proposed. This geometric mean method achieved good performance in dealing with the class separation problem. Bian and Tao [42] introduced harmonic mean for subspace selection, which maximizes the harmonic mean of the symmetric KL divergences between all class pairs. At the same time, there are many novel graph-based manifold learning algorithms that rose in recent year. One of the most representative methods in this field should be the locality preserving projections (LPP) [34], which introduces the local geometry and neighboring relationship into the objective functions. With the similar ideas as used in LPP, Zhang et al. [43] presented a general framework, i.e. patch alignment, for dimensionality reduction and developed a new method called discriminative locality alignment (DLA) by imposing discriminative information in the part optimization stage. Zhou et al. [44] proposed the manifold elastic net (MEN), which incorporating the merits of both the manifold learning
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
based and the sparse learning based dimensionality reduction. Both DLA and MEN can be viewed as the graph-based dimensionality reduction methods, which construct a special graph to characterize the local structure of the dataset. The proposed EFLMME are significantly different from the existing methods, including the recently proposed DLA and MEN. EFLMME focuses on designing two refined graphs to characterize the local compactness and local separability by introducing the fuzzy neighborhood membership degree of each sample. DLA can be viewed as a linear dimensionality reduction method which uses the local patch alignment to construct the special graph to reflect the structure of the dataset. Similar to DLA, MEN also uses the local patch alignment to characterize the manifold structure, combining the Elastic Net [47] to learn the sparse projections with interpretations.
3.2. The idea of the proposed method In manifold learning based methods for feature extraction, samples in the margin of different classes may be projected to neighbor points in the low-dimensional space. Moreover, the graphs are constructed based on the assumption that the same membership degree are assigned to each sample in the same class or in the local neighborhood during construction of affinity graphs, which can be seen from LPP and its supervised version. Since the distances between the samples in local K-nearest neighborhood might also vary in a big range, the graphs constructed in this way might have potential disadvantages that the weights are not in accordance with the nature relations of the data in actual applications. These may lead to potential misclassification. So in this stage, we present an enhanced fuzzy local maximal marginal embedding (EFLMME) algorithm. In EFLMME, the enhanced fuzzy k-nearest neighbor is implemented in order to model the nature distribution information of original samples. Then this information is utilized to redefine the affinity weights of neighborhood graph instead of using the binary pattern or Gaussian weighted pattern. Through the fuzzy gradual pattern, two novel fuzzy gradual graphs (intra-class and interclass) are constructed. During the construction of fuzzy gradual intra-class graph, the main idea is that the nearer the neighbors, the greater the fuzzy weights. During the construction of fuzzy gradual interclass graph, the main idea is that the farther the neighbors, the greater the fuzzy weights. By doing like these, the resulting projections can map the near neighbor samples in same class becomes nearer to each other and repel the farther neighbor samples of margin between different classes to be farther from each other when they are projected to low-dimensional subspace. The details are described as follows.
3.3. Computation of enhanced fuzzy membership degree EFLMME designs an enhanced fuzzy intra-class membership degree and interclass membership degree. For an enhanced fuzzy intra-class membership degree, k-nearest neighbors are characterized through their contributions to the intra-class compactness: the nearer the neighbors, the greater the fuzzy weights. The k-nearest neighbor in an enhanced fuzzy interclass membership degree is related to the interclass separability: the farther the neighbors, the greater the fuzzy weights. The enhanced fuzzy intra-class membership degree is defined as below ( 0:51 þ 0:49 ðnic =kÞ if the ith sample A cth class uic ¼ , otherwise 0:49 ðnic =kÞ
2933
where nic ¼
k X
ð1þ eÞkq ;
0 o e o 0:1
ð8Þ
q¼1
where nic stands for the degree of the neighbors of the ith data (pattern) belonging to the cth class and e is a constant value that describes the compactness degree in the same classes. The value of e is set to be 0.02. q represents the qth neighbor of the sample within the same class. The nearer the neighbor is, the more useful it is for classification. Therefore, the nearest neighbor has the biggest weight (1þ e)k 1. The weight for the kth neighbor is 1. Based on the concept described above, it is expected that the closer samples within the same class become more compact. As a result, a more reliable intra-class membership degree description can be achieved in the low dimensional space. Thus, when the intra-class membership degree is used to construct the intra-class graph, the resulting graph can better reflect the intra-class data structure. The enhanced fuzzy interclass membership degree is defined in a similar format of fuzzy intra-class membership degree as below ( = cth class 0:51 þ 0:49 ðnic =kÞ if the ith sample 2 pic ¼ otherwise 0:49 ðnic =kÞ where nic ¼
k X
ð1þ bÞq ;
0 o b o0:1
ð9Þ
q¼1
where b is a constant representing the separation degree in different classes. The value of b is set to be 0.02. To define an interclass membership degree, a further neighbor is more useful for classification. The furthest neighbor, i.e. the kth neighbor has the biggest weight (1þ b)k 1. Thus the far samples belonging to different classes are expected to become more separate in the low dimensional space. Thus, a more reliable interclass membership degree description is obtained and can be used to create an interclass graph that can better reflect the interclass data structure.
3.4. Construction of fuzzy gradual neighborhood graphs Suppose that the mapping from xi to yi is o ¼ [o1, o2, y, od], i.e. yi ¼ oTxi. By following the graph embedding formulation, interclass separability, namely margin, is characterized by a fuzzy gradual interclass graph 0 1 X X X 2 T T s T s T T s T A @ SS ¼ Jo xi o xj J Wij ¼ 2 o xi Dii xi o o xi Wij xi o ij
i
ij
¼ 2ðoT XDS X T ooT XW S X T oÞ ¼ 2oT XðDS W S ÞX T o, ( Wijs ¼
ð10Þ
pði,nÞ pðj,mÞ ,
if i A Nk1 ðjÞ, i AClass m or j A Nk1 ðiÞ, j A Class n
0,
else:
ð11Þ S
where W characterizes the affinity weights between classes, in which each elements Wijs refers to the weight of the edge between xi and xj in different classes, DS is a diagonal matrix with diagonal P elements Dsii ¼ j Wijs and Nk1 ðiÞ indicates the index set of the k1 nearest neighbors of the sample xi in different classes.
2934
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
Intra-class compactness is calculated from enhanced fuzzy intra-class graph 0 1 X X X 2 T T c T c T T c T A @ SC ¼ Jo xi o xj J Wij ¼ 2 o xi Dii xi o o xi Wij xi o ij
i
ij
¼ 2ðoT XDC X T ooT XW C X T oÞ ¼ 2oT XðDC W C ÞX T o ( Wijc ¼
else ð13Þ
where WC characterizes the affinity weights in the same class, in which each elements Wijc refers to the weight of the edge between xi and xj in the same class, DC is a diagonal matrix with diagonal P elements Dcij ¼ j Wijc and Nkþ2 ðiÞ indicates the index set of the k2 nearest neighbors of the sample xi in the same class. 3.5. Objective function The proposed algorithm is expected to find the optimal projections that can minimize enhanced fuzzy intra-class graph and simultaneously maximize enhanced fuzzy interclass graph. We then have following constrained optimized problem: XX Maximize SS ¼ JoT xi oT xj J2 Wijs subject to SC ¼
i
j
i
j
XX
JoT xi oT xj J2 Wijc ¼ 1
ð14Þ
The above criterion is formally similar to the Fisher criterion since they are both Rayleigh quotient problems. Therefore, we can obtain its optimal solutions by solving a generalized eigenequation XðDs W s ÞX T o ¼ lXðDc W c ÞX T o
Once the project matrix PEFLMME is obtained through using the FLMME algorithm, the nearest neighbor classification can be used for classification.
ð12Þ
mði,mÞ mðj,mÞ , if i A Nkþ2 ðjÞ or j A Nkþ2 ðiÞ and i,j A Class m 0,
Step 7. Solve the generalized eigen-function of Eq. (15) and obtain the optimal projection matrix PEFLMME.
ð15Þ
where o is generalized eigenvector corresponding to generalized eigenvalue l. Then, we can select the eigenvectors associated to the d largest eigenvalues as the optimal projection matrix P, where P ¼[o1, o2, y, od]. 3.6. Procedure of MS-gist feature manifold model The proposed algorithm is described as a two-stage model. In the first stage, MS-scale gist feature is extracted. In the second stage, EFLMME is applied to multi-scale gist feature space for dimensionality reduction. The following steps summarize the proposed algorithm: Step 1. Convert the input color image to intensity image using Eq. (1) and decompose the intensity image at different scales using Eq. (3). Step 2. Filter the intensity image at different scales using Eq. (4). Step 3. Compute the MS-gist feature using Eqs. (5) and (6). Step 4. Compute the fuzzy gradual intra-class and the interclass membership degree matrices on the MS-gist feature space using Eqs. (8) and (9), respectively. Step 5. Compute affinity weights using Eqs. (11) and (13), respectively. An edge is added between xi and xj from the same class if xj is one of xi’s k-nearest neighbors. For the interclass graph, xi is connected with xj from the different classes if xj is one of xi’s k-nearest neighbors. Step 6. Create the fuzzy gradual intra-class compactness graph and fuzzy gradual interclass separability graph using Eqs. (10) and (12), respectively.
4. Experiments and discussions To evaluate performance of our proposed model, experiments were carried out on Sheffield buildings database [48], compared with other models: (a) the visual gist based building recognition model (VGBR): each image is described by a gist feature based on visual features [17]; (b) the hierarchical building recognition model (HBR): each image is represented by a local color histograms indexing vector [16]. At the same time, EFLMME is evaluated on Sheffield buildings database compared with PCA, LDA, LPP and DLA. Sheffield buildings database includes 40 buildings, i.e. 40 categories in total. For each building, the number of images varies from 35 to 334. In summary, there are 3192 images of size 120 160. The building database is of a variety of challenges, i.e. rotation, variant lighting conditions, viewpoint changes, occlusions and vibration. They were taken at different times on separate days. Different times cover early morning, noon, mid-afternoon and early evening. This leads to highly variable lighting conditions and results in the building recognition task more challenging. Buildings include churches and a variety of modern buildings, such as exhibition halls. Images for each building were taken from different viewpoints, varying from three to nine views; through moving the camera from side to side, video clips are also obtained with multiple views. Buildings might be occluded from trees, moving vehicles, other buildings or themselves. Furthermore, some videos were captured by walking from one side to another, which creates the additional challenge of movement/camerashake. For each building, the number of the images is significantly different in the original database. In order to test the performance of the proposed framework with the small and the large number of training samples, two subsets (i.e. subsets I and II) were selected from the Sheffield building database and used in our experiment, respectively. The subset I consists 40 building categories, and the first 18 images for each building in the original database were selected and used in the experiments. Thus there were 720 images in the subset I in total. Some sample images in the subset I are shown in Fig. 3. The subset II consists 39 building categories and the first 60 images for each building were selected and used in the following experiments. The total number of images in subset II is 2340. Some sample images in the subset II are shown in Fig. 4. In the proposed model, we first converted the input RGB color image to intensity image using Eq. (1) and decomposed the intensity image at different scales (s ¼0,1) using Eq. (3). For each intensity image with different scales, we filtered the intensity image using Eq. (4) and then extracted the MS-gist feature of each building image using Eqs. (5) and (6). But the MS-gist feature space still contains redundant and useless information and the dimensions of the feature are still high. In order to remove this redundant information and alleviate the computational complexity in classification, we then use the proposed EFLMME algorithm for further dimensionality reduction. In this stage, we computed the fuzzy gradual intra-class and the interclass membership degree matrices on the MS-gist feature space using Eqs. (8) and (9), and then created the fuzzy gradual intra-class compactness
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
2935
Fig. 3. (a)–(c) are sample image from the category 12, 25 and 39, respectively.
Fig. 4. (a)–(c) are sample image from the category 5, 10 and 37, respectively.
graph and fuzzy gradual interclass separability graph using Eqs. (10) and (12). Subsequently, we solved the generalized eigen-function of Eq. (15) and obtained the optimal projection matrix. Finally, the nearest neighborhood classifier was utilized for classification. The number of intra-class local neighborhood was chosen as k1 ¼l 1 in the EFLMME, where l denotes the number of training samples per category. The choice is testified to be reasonable in the observation space in [45]. 4.1. Experiments on the subset I The subsetIconsists 40 building categories. There are 18 images for each building and total number is 720. In the experiments, each image of the subset I is firstly represented by a 512 2 MS-gist feature, then the feature is reshaped as a 1024dimensionality vector, and feature dimensionality is subsequently reduced by PCA, LDA, LPP, DLA and the proposed algorithm (EFLMME). 4.1.1. Validation of the proposed MS-gist feature In this subsection, we use the first l (l¼3,5,7,9) images of each building for training and the rest for testing. In order to validate the performance of the MS-gist feature, the nearest neighbor (NN)
Table 1 Recognition accuracy (%) on the subset I of different image preprocessing model (shown in parentheses).
Raw data space þNN HBR feature space þNN VGBR feature space þNN MS-gist feature space þNN
3 train
5 train
7 train
9 train
45.67 63.26 64.17 75.33
61.15 78.24 80.81 88.36
67.27 83.26 86.72 89.18
71.94 87.56 89.97 93.33
(98) (96) (100) (84)
(105) (103) (90) (90)
(95) (96) (98) (98)
(103) (101) (103) (105)
classifier is directly applied to the MS-gist feature space and compared with raw data space, HBR feature space and VGBR feature space. The results are showed in Table 1. The performance of the proposed MS-gist feature is superior to those of other model. Since pre-filtering steps and the holistic structure information of building images at different scales are introduced in the MS-gist model, the extracted MS-gist features are more robust to different viewpoints, different lighting conditions and occlusions. Thus, the proposed model obtains better performance for building recognition.
4.1.2. Validation of the different feature plus EFLMME In this subsection, we use the first l (l¼3,5,7,9) images of each building for training and the rest for testing. To demonstrate the
2936
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
MS-gist features. Compared to EFLMME for dimension reduction in the second stage, the extracted MS-gist features in the first stage makes greater contributions to total recognition accuracies.
benefits of the proposed feature extraction model, we compared it with raw building data and other models [16,17]. In the second stage, the identical algorithm, i.e. EFLMME, is used to the features of different models for dimensionality reduction. Table 2 lists the best performances obtained by different feature model. As can be seen in Table 2, the proposed model performs better than the other models when the same dimensionality reduction method is used. Furthermore, when compared Table 1 with Table 2, we can find EFLMME help all feature models improve the building recognition rates. Therefore, not only the MS-gist feature can more effective for building recognition, but also the EFLMME can further improve the classification accuracies.
4.1.4. Random experiment In this subsection, for each building, l ( ¼3,5) images are randomly selected as training samples, and the rest images are used for test. All the five methods, PCA, LDA, LPP, DLA and EFLMME are utilized for dimensionality reduction on the multiscale gist feature space. The procedure is repeated 50 times and the average recognition rate was calculated. In general, the recognition rates vary with the dimension of the multi-scale gist feature space. Fig. 5 shows the variation of average recognition rates using different algorithms with different dimensions. It can be seen from the figures that the performance of EFLMME outperforms the other methods on the MS-gist feature space.
4.1.3. Global validation of the proposed model Since there are two parts in the proposed model: the MS-gist feature and the EFLMME for dimensionality reduction, in order to measure the global validation of the proposed method, different dimensionality reduction algorithms such as PCA, LDA, LPP, DLA and EFLMME are, respectively, applied to the MS-gist features and the raw data in the second stage. In this subsection, we also use the first l (l ¼3,5,7,9) images of each building for training and the rest for testing. Table 3 shows the best performances obtained by different dimensionality reduction algorithms applied to MS-gist feature space and raw data space, respectively. From Table 3, we can see that the MS-gist feature followed by EFLMME achieves a remarkable improvement over the other dimensionality reduction methods on the MS-gist features, and the EFLMME also performs much better than the compared linear dimensionality techniques on the raw data. It should be noted that when different dimensional reduction methods including EFLMME are applied to MS-gist feature space or raw data space in the second stage, the recognition accuracy rates can be further improved. But the improvements of EFLMME are the greatest among the compared methods. These indicate that EFLMME is superior to the other dimensionality reduction methods. The contributions of the improvements of MS-gist features and the EFLMME are varied from different training samples. As can be seen from Table 3, both of the two steps of the proposed method are contributive to building recognition. Take the 3 training samples as example, MS-gist feature obtains about 30 (75.33– 45.67 ¼29.66) percents’ improvement and EFLMME obtains about 5 (80.50–75.33¼5.17) percents’ further improvement based on
4.2. Experiments on the subset II To further evaluate the performance of the proposed model, experiments were carried on the bigger dataset, i.e. the subset II. The subset II consists 39 building categories. There are 60 images for each building and the total number is 2340. Similarly, each image is firstly represented by a 512 2 MS-gist feature, and the dataset is represented by the MS-gist features. Subsequently, techniques of dimensionality reduction are applied to the multiscale gist feature manifold. 4.2.1. Validation of the proposed MS-gist feature In this subsection, for each building, we use the first l (l ¼10,20,30,40) images for training and the rest for test. In order to validate the performance of the MS-gist features, the nearest neighborhood classifier is directly applied to the MS-gist feature space and compared with raw data space, HBR feature space and VGBR feature space. The results are showed in Table 4. The performance of the proposed MS-gist feature is superior to those of other feature. 4.2.2. Validation of the different feature plus EFLMME In this subsection, for each building, we use the first l (l¼10,20,30,40) images for training and the rest for testing. To demonstrate the benefits of the proposed our feature extraction model, we compared it with raw building data and other models in [16,17]. The identical algorithm- EFLMME is used for dimensionality reduction. In general, the recognition rates varied with the dimensions of the features. The best performances obtained by different models are presented in Table 5. Fig. 6 demonstrates the recognition accuracy of raw data model, HBR model, VGBR model and the proposed model over the variations of the dimensionality corresponding to the first l (l¼ 10,20,30,40) images for training. From Table 5 and Fig. 6, we can see that the performance of the proposed MS-gist features also outperforms other features significantly. This testifies the effectiveness of the proposed model again.
Table 2 Recognition accuracy (%) on the subset I and the corresponding dimension (shown in parentheses).
Raw data model þEFLMME HBR model þEFLMME VGBR model þEFLMME MS-gist model þEFLMME
3 train
5 train
7 train
9 train
53.74 65.15 66.50 80.50
68.67 79.37 82.88 91.73
75.32 84.49 87.50 93.64
80.48 89.89 91.11 97.61
(31) (81) (87) (81)
(33) (77) (101) (103)
(49) (97) (101) (99)
(35) (95) (99) (79)
Table 3 Recognition accuracy (%) on the subset I and the corresponding dimension (shown in parentheses). Multi-scale gist feature space
NN PCA LDA LPP DLA EFLMME
Raw data space
3 train
5 train
7 train
9 train
3 train
5 train
7 train
9 train
75.33 76.46 74.83 70.50 78.67 80.50
88.36 89.15 89.04 82.88 91.27 91.73
89.18 91.18 90.23 87.72 92.23 93.64
93.33 94.61 92.22 88.89 95.78 97.61
45.67 49.17 30.50 29.83 42.67 53.74
61.15 65.77 50.96 44.42 55.19 68.67
67.27 69.32 60.68 49.32 59.09 75.32
71.94 76.39 65.83 61.67 62.78 80.48
(84) (93) (29) (53) (52) (81)
(90) (105) (39) (51) (56) (103)
(98) (105) (39) (53) (86) (99)
(105) (95) (37) (53) (74) (79)
(98) (25) (31) (21) (98) (31)
(105) (19) (31) (25) (50) (33)
(95) (19) (24) (29) (88) (49)
(103) (23) (39) (15) (98) (35)
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
2937
Fig. 5. The average recognition rates (%) of different algorithms versus the dimensions when l (¼ 3,5) images were randomly selected for training on the MS-gist feature space (50 times independently running). (a) 3 images randomly selected for training; (b) 5 images randomly selected for training.
Table 4 Recognition accuracy (%) on the subset II of different image preprocessing model.
Raw data space þNN HBR feature space þNN VGBR feature space þNN MS-gist feature space þNN
10 train
20 train
30 train
40 train
54.72 71.43 73.38 79.05
57.62 73.33 77.81 87.14
60.43 76.23 81.82 91.62
63.59 79.19 82.77 93.85
(92) (94) (90) (85)
(95) (88) (88) (92)
(99) (93) (103) (105)
(105) (103) (100) (95)
Table 5 Recognition accuracy (%) on the subset II and the corresponding dimension (shown in parentheses).
Raw data model þEFLMME HBR model þEFLMME VGBR model þ EFLMME MS-gist model þEFLMME
10 train
20 train
30 train
40 train
65.21 73.54 75.69 84.85
67.18 77.33 79.68 92.10
70.48 81.65 82.90 93.59
75.54 82.13 84.51 96.90
(41) (85) (89) (43)
(19) (98) (103) (47)
(45) (98) (93) (53)
(47) (91) (95) (78)
4.2.3. Global validation of the proposed model In the subsection, EFLMME is applied to the MS-feature space and the raw data space in the second stage and compared with PCA, LDA, LPP and DLA. For each building, we also use the first l (l¼10,20,30,40) images for training and the rest for testing. The results are showed in Table 6. As can be seen in Table 6, the MS-gist feature achieves a better results compared with the raw data feature. After different dimensional reduction methods are applied to feature space in the second stage, the recognition accuracy rates are further increased, again.
4.2.4. Random experiment based on MS-gist features To measure the validation of the proposed model with different random training samples, in this subsection, for each building, l (¼10,20,30,40) images are randomly selected as training samples, and the rest images are used for test. For each l, we first use our MSgist model to image feature extraction and then, PCA, LDA, LPP, DLA and EFLMME are utilized for further dimensionality reduction on the multi-scale gist features. The procedure is repeated 10 times and the average recognition rate was calculated. Fig. 7 shows the variation of average recognition rates using different algorithms with different dimensions. As shown in Fig. 7, the performance of EFLMME is also superior to those of PCA, LDA, LPP and DLA.
4.3. Discussions Based on the experimental results in Sections 4.1–4.2, we can obtain the following conclusions: (1) As far as building recognition tasks is concerned, the proposed model achieves remarkable improvements when compared with other models. This shows that the proposed model is an effective method in practice of building recognition. Experimental results show that the proposed model can handle the building recognition problem caused by rotation, variant lighting conditions, and occlusions very well. (2) In this paper, the proposed model is described in two stages: extraction of MS-gist feature and dimensionality reduction using EFLMME. From Tables 1 and 4 and Fig. 6, we can see that performances on the MS-gist feature space are remarkably better than other feature space. From Tables 2 and 5, we can see that EFLMME using on the different feature spaces can greatly enhance the recognition rates. This indicates EFLMME is an effective dimensionality reduction method. As is shown in Tables 3 and 6, the MS-gist feature achieves a better result compare with the raw data feature under the different dimensional reduction methods. (3) From Tables 3 and 6 and Figs. 5 and 7, we also find the performance of EFLMME is superior to PCA, LDA, LPP and DLA both on the MS-gist feature space and raw data space. The reason is that the enhanced fuzzy membership degree can efficiently handle the vagueness and ambiguity of samples degraded by poor illumination, shape and view-point variations. In other words, the enhanced fuzzy membership degree helps pull the near neighbor samples in same class nearer and nearer and repel the far neighbor samples of margin between different classes farther and farther. So, the fuzzy gradual graphs based on the fuzzy gradual membership degree can better characterize the compactness and separability. Thus, EFLMME performs better than other dimensionality reduction techniques. (4) Our experimental results show that the proposed model has great superiority to other models. The reasons may be that the proposed two stage method can provide more separate intrinsic feature of the different buildings. In the first stage, the MS-gist feature provides structural information represents the spatial frequencies spread everywhere in the image and thus information about the orientation, smoothness, length and width of the contours that compose the building images. Therefore, the MS-gist feature effectively captures the building representation,
2938
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
Fig. 6. (a)–(d) are the recognition accuracy with the dimension of raw data model, HBR model ,VGBR model and the proposed model on subset II corresponding to the first l (l¼ 10,20,30,40) images for training. (a) The first 10 images for training; (b) the first 20 images for training; (c) the first 30 images for training; (d) the first 40 images for training. Table 6 Recognition accuracy (%) on the subset II and the corresponding dimension (shown in parentheses) based on MS-gist model features. Multi-scale gist feature space
NN PCA LDA LPP DLA EFLMME
Raw data space
10 train
20 train
30 train
40 train
10 train
20 train
30 train
40 train
79.05 79.15 79.72 73.72 82.00 84.85
87.14 87.95 86.27 81.28 88.97 92.10
91.62 89.37 88.09 86.96 92.56 93.59
93.85 91.85 90.12 88.28 93.08 96.90
54.72 60.00 55.84 54.15 58.18 65.21
57.62 63.07 67.05 61.79 63.87 67.18
60.43 66.49 68.54 69.48 69.94 70.48
63.59 70.38 72.31 73.33 73.38 75.54
(85) (71) (35) (19) (50) (43)
(92) (83) (35) (18) (50) (47)
(105) (103) (31) (19) (30) (53)
which is invariant with rotations, variant lighting conditions, and occlusions. In the second stage, the performance of EFLMME outperforms those of PCA, LDA, LPP and DLA. This indicates the EFLMME is an effective dimensionality reduction method, which performs dimensionality reduction on the proposed MS-gist feature and preserves more discriminative information for building recognition. Therefore, after EFLMME applied to the MS-gist feature space, the recognition rate of the proposed model can be further increased.
5. Conclusions In this paper, we have described a MS-gist feature for building recognition and proposed a new approach called
(95) (89) (21) (16) (52) (78)
(92) (39) (33) (37) (48) (41)
(95) (23) (23) (41) (50) (19)
(99) (27) (29) (27) (50) (45)
(105) (39) (29) (29) (52) (47)
EFLMME for further improve the building recognition rate. The MS-gist features can be stable to capture the representation features of the building images with rotation, variant lighting conditions and occlusions. In EFLMME the enhanced fuzzy membership degree helps pull the near neighbor samples in same class nearer and nearer and repel the far neighbor samples of margin between different classes farther and farther. Therefore, the fuzzy gradual graphs, which are based on the fuzzy gradual membership degree, can better characterize the compactness and separability. Experiments were carried out on Sheffield buildings database compared with the VGBR model and HBR model. At the same time, EFLMME is evaluated on Sheffield buildings database and compared with PCA, LDA, LPP and DLA. The results show that the proposed model is an effective method in building recognition and can handle the
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
2939
Fig. 7. The average recognition rates (%) of different algorithms versus the dimensions when l (¼10,20,30,40) images were randomly selected for training on the multiscale gist feature space (10 times independently running). (a) 10 images randomly selected for training; (b) 20 images randomly selected for training; (c) 30 images randomly selected for training; (d) 40 images randomly selected for training.
building recognition problem caused by rotations, variant lighting conditions and occlusions very well. However, we mainly focus on the building recognition accuracy rate in the paper instead of the truly semantic concept extraction in this paper. So the MS-gist feature may neglect to discover a truly semantic concept of the image, which is probably much important in the long run. Moreover, EFLMME is a supervised learning on building recognition, the label information is needed to find, which may be a boring work. Therefore, in the future, we will focus both on the truly semantic feature extraction and the unsupervised learning algorithm with good performance on building recognition. Acknowledgments This work is partially supported by the FuJian Provincial Department of Science and Technology of China under Grant nos. JK2010046 and JB10135. It is also partially supported by the National Science Foundation of China under Grant nos. 60472061, 60632050 and 90820004 and Hi-Tech Research and Development Program of China under Grant no. 2006AA04Z238. References [1] G. Dorko, C. Schmid, Selection of scale-invariant parts for object class recognition, in: Proceedings of IEEE International Conference on Computer Vision, vol. 1, 2003, pp. 634–639.
[2] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: Proceedings of IEEE Conference on Computer Vision Pattern Recognition, vol. 2, 2003, pp. 264–271. [3] V. Ferraril, T. Tuytelaars, L.V. Gool, Simultaneous object recognition and segmentation by image exploration, Lect. Notes Comput. Sci. 4170 (2006) 147–151. [4] G. Friz, C. Seifert, L. Paletta, Urban object recognition from informative local feature, in: Proceedings of IEEE International Conference on Robotics and Automation, 2005, pp. 131–137. [5] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision 60 (2) (2004) 91–110. [6] M. Varma, A. Zisserman, A statistical approach to texture classification from single images, Int. J. Comput. Vision 62 (1–2) (2005) 61–81. [7] M.M. Ullah, A. Pronobis, B. Caputo, J. Luo, R. Jensfelt, H.I. Christensen, Towards robust place recognition for robot localization, in: Proceedings of IEEE International Conference on Robotics and Automation, 2008, pp. 530–537. [8] M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing Systems 14, Vancouver, British Columbia, Canada, 2002. [9] R. Hutchings, W. Mayol-Cuevas, Building Recognition for mobile Devices: Incorporating Positional Information with Visual Features, CSTR-06-017, Computer Science, University of Bristol, 2005. [10] C. Harris, M. Stephens, A combined corner and edge detector, in: Proceedings of Alvey Vision Conference, 1988, pp. 147–151. [11] Q. Iqbal, J.K. Aggarwall, Applying perceptual grouping to content-based image retrieval: building images, in: Proceedings of IEEE Conference Computer Vision Pattern Recognition, vol. 1, 1999, pp. 42–48. [12] A. Stassopoulou, Building detection using Bayesian networks, Int. J. Pattern Recognition Artif. Intell. 83 (5) (2000) 705–740. [13] Y. Li, L.G. Shapiro, Consistent line clusters for building recognition in CBIR, in: Proceedings of IEEE International Conference on Pattern Recognition, vol. 3, 2002, pp. 952–956. [14] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vision 60 (2) (2004) 169–191.
2940
C. Zhao et al. / Neurocomputing 74 (2011) 2929–2940
[15] G. Fritz, C. Seifert, M. Kumar, L. Paletta, Building detection from mobile imagery using informative SIFT descriptors, in: Proceedings of SCIA, 2005, pp. 629–638. [16] W. Zhang, J. Kosecka, Hierarchical building recognition, Int. J. Image Vision Comput. 25 (2007) 704–716. [17] J. Li, Nigel M. Allinson, Subspace learning-based dimensionality reduction in building recognition, Int. J. Neurocomput. 73 (1–3) (2009) 324–330. [18] A. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychol. 12 (1980) 97–137. [19] A. Oliva, P.G. Schyns, Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli, Cognitive Psychol. 34 (1997) 72–107. [20] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, Int. J. Comput. Vision 42 (3) (2001) 145–175. [21] A. Oliva, Gist of scene, in: L. Itti, G. Rees, J.K. Tsotsos (Eds.), Neurobiol. Attention, Elsevier, San Diego, CA, 2005, pp. 251–256. [22] A. Torralba, A. Oliva, Depth estimation from image structure, IEEE Trans. Pattern Anal. Mach. Intell. 24 (9) (2002) 1226–1238. [23] A. Torralba, Contextual priming for object detection, Int. J. Comput. Vision 53 (2) (2003) 169–191. [24] S.C. Chong, A. Treisman, Representation of statistical properties, Elsevier Int. J. Vision Res. 43 (2003) 393–404. [25] D. Marr, E.C. Hildreth, Theory of edge detection, Proc. R. Soc. London Ser. B, Biol. Sci. 207 (1167) (1980) 187–217. [26] A. Shashua, S. Ullman, Structural saliency: the detection of globally salient structures using a locally connected network, in: Proceedings of IEEE International Conference on Computer Vision, 1988, pp. 321–327. [27] Y. Sugase, S. Ueno, K. Kawano, Global and fine information coded by single neurons in the temporal visual cortex, Nature 400 (1999) 869–873. [28] A. Oliva, P.G. Schyns, Coarse blobs or fine edges? Evidence that information diagnosticity changes the perception of complex visual stimuli, Cognitive Psychol. 34 (1997) 72–107. [29] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 4–37. [30] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [31] G.J. Mclachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley-Interscience, New York, 1992. [32] I.T. Jolliffe, Principal Component Analysis, second ed., Springer, 2002. [33] C.J. Mclachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley-Interscience, New York, 1992. [34] X. He, P. Niyogi, Locality Preserving Projections, in: Proceedings of Conference on Advances in Neural Information Processing Systems, 2003. [35] M. Belkin, Laplacian Eigenmaps for dimensionality reduction and data representation, MII Press J. Neural Comput. 15 (6) (2003) 1373–1396. [36] M. Wan, Z. Lai, J. Shao, Z. Jin, Two-dimensional local graph embedding discriminant analysis (2DLGEDA) with its application to face and palm biometrics, Neurocomputing 73 (2009) 193–203. [37] J.M. Keller, M.R. Gray, J.R. Givens, A fuzzy k-nearest neighbor algorithm, IEEE Trans. Syst. Man Cybernet. 15 (4) (1985) 580–585. [38] K.C. Kwak, W. Pedrycz, Face recognition using a fuzzy fisherface classifier, Pattern Recognition 38 (2005) 1717–1732. [39] D.J. Song, D.C. Tao, Biologically inspired feature manifold for scene classification, IEEE Trans. Image Process. 19 (1) (2010) 174–184. [40] D.C. Tao, X.L. Li, X.D. Wu, S.J. Maybank, General Averaged Divergences Analysis, in: Proceedings of IEEE International Conference on Data Mining, 2007. [41] D.C. Tao, X.L. Li, X.D. Wu, S.J. Maybank, Geometric mean for subspace selection, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 260–274. [42] W. Bian, D.C. Tao, Harmonic Mean for Subspace Selection, in: Proceedings of 19th IEEE International Conference on ICPR, 2008.
[43] T.H. Zhang, D.C. Tao, X.L. Li, J. Yang, Patch alignment for dimensionality reduction, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1299–1313. [44] T.Y. Zhou, D.C. Tao, X.D. Wu, Manifold elastic net: a unified framework for sparse dimension reduction, Data Mining and Knowledge Discovery (DMKD; Springer) 22 (3) (2010) 340–371. [45] Y. Jian, D. Zhang, J.Y. Yang, Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics, IEEE Trans. Pattern Anal. Mach. Intell. 29 (4) (2007) 650–664. [46] C. Ackerman, L. Itti, Robot steering with spectral image information, IEEE Trans. Robot. 21 (2) (2005) 247–251. [47] H. Zhou, T. Hastie, Regularization and variable selection via elastic net, J. R. Statist. Soc. B 67 (Part 2) (2005) 301–320. [48] /http://eeepro.shef.ac.uk/building/dataset.rarS.
Cairong Zhao received his B.S. degree in Electronics and Information Scientific Technology from Jilin University, China, in 2003 and M.S. degree in Electronic Circuit and System from Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, in 2006, respectively. He is currently pursuing his Ph.D. degree in Computer Science from Nanjing University of Science and Technology. His research interests include face recognition, building recognition and vision attention.
Chuancai Liu is a full professor in the school of computer science and technology of Nanjing University of Science and Technology, China. He obtained his Ph.D. degree from the China Ship Research and Development Academy in 1997. His research interests include AI, pattern recognition and computer vision. He has published about 50 papers in international/ national journals.
Zhihui Lai received the B.S. degree in Mathematics from South China Normal University and M.S. degree from Ji’nan University, China, in 2002 and 2007, respectively. He is currently pursuing the Ph.D. degree in the School of Computer Science from Nanjing University of Science and Technology (NUST). His research interests include face recognition, image processing and content-based image retrieval, pattern recognition, compressive sense, human vision modelization and applications in the fields of intelligent robot research.