CIMCA 2008, IAWTIC 2008, and ISE 2008
Multi-Semantic Scene Classification Based on Region of Interest Junming Shao, Dongjian He College of Information Engineering, Northwest A&F University, P.R. China, 712100 {xinyuanwo,hdj87091197}@yahoo.com.cn
Qinli Yang College of Resources and Environment, Northwest A&F University, P.R. China, 712100
[email protected]
Abstract
these cases, an object is associated with multiple instances and can be regarded as multiple class labels simultaneously. The supervised learning framework actually can not fit these real-world problems very well. In order to handle such problems, in this paper, we propose new techniques to cope with these problems and apply them into multi-semantic scene classification. The main contributions are as follow:
Automatic semantic scene classification is a challenging research topic in computer vision and it is also a promising solution to scene understanding and image semantic retrieval. In this paper, novel techniques are proposed to implement multi-semantic scene classification. We first extract some regions of interest (ROIs) from each image based on image-driven, bottom-up visual attention model, and then propose two multi-instance multi-label learning algorithms, EMDD-SVM and EMDD-KNN to cope with this problem, where images are viewed as bags, each of which contains a number of instances corresponding to regions of interest and belongs to multiple categories simultaneously. Experimental results show that our ROIs extraction algorithm could obtain different kinds of interested objects effectively under various complex clutters and is highly tolerant to the noise, and that EMDD-SVM and EMDD-KNN algorithms have achieved good performance on multi-semantic scene classification by integrating multi-instance learning and multi-label learning.
1. We propose a novel approach to extract ROIs from each image motivated biological visual mechanism, which is based on bottom-up visual attention model. 2. We present two multi-instance multi-label learning algorithm, EMDD-SVM and EMDD-KNN by integrating multi-instance learning and multi-label learning to learn multi-semantic scene classification problem. The remainder of the paper is organized as follows. Section 2 discusses related work about scene classification. Section 3 describes a new approach to extract ROI and image representation. Section 4 presents two novel multi-instance multi-label algorithms, EMDD-SVM and EMDD-KNN to implement multi-semantic scene classification based on ROIs. Section 5 shows the experiments that we have performed and the relevant results. Finally, we conclude in Section 6.
1. Introduction Designing computer programs to automatically categorize images into semantic classes using low-level features is a challenging research topic in computer vision [3]. Conventional approaches deal with the problem mostly based on the supervised learning framework, where each image is represented as an instance and associated with one semantic class label. However, in real-world problems, an image usually contains a lot of regions corresponding to instances and belongs to multiple semantic classes simultaneously. The similar situation also exists in text categorization, where a document usually includes multiple sections and each of which can be represented as an instance and it may belong to multiple genres, such as government and health, or rock and blues. Further examples are web page categorization, gene functional analysis, and medical diagnosis, etc. In all
978-0-7695-3514-2/08 $25.00 © 2008 IEEE DOI 10.1109/CIMCA.2008.59
2. Related work At present, much research has been done on representing and classifying images in the last several decades, the detailed review of these approaches can be found in reference [1]. These conventional approaches mostly focus on single-label (two-class and multi-class) problems, where an image is only associated with one semantic label. However, only a few literatures concern the problems of multisemantic scene classification currently. Boutell et al. [2] firstly extended SVM into the multilabel learning to handle the multi-semantic scene classification problem by cross training, where a natural scene is described by multiple class labels simultaneously. Zhang and
732
Zhou [10] propose a novel multi-label learning algorithm, MLKNN to deal with the multi-label porblems and apply it to yeast gene functional analysis, natural scene classification and automatic web page categorization. The basic idea of both approaches is to employ the multi-label learning framework to deal with such problems. However, in real-world, an image may contain a lot of objects and belongs to multiple semantic classes simultaneously. To fit the real-world problem better, therefore, Zhou and Zhang [12] propose a new learning framework, multi-instance multilabel learning framework to deal with such problem by using the multi-instance learning and multi-label learning as the bridge respectively. In this paper, we propose two novel MIML algorithms from another aspect by integrating multi-instance learning and multi-label learning based on region of interest, which are extracted by imitating biological visual attention mechanism.
high probability that the FOA we obtained based on Itti’s algorithm is isolated point. In order to extract potential primary object effectively in complex scene, two simple principles are proposed to restrict the focus of attention, which are whole effect and center preference. The whole effect means that if the object popped out in background, the global positions of the object should be salient on the salience map. The center preference means that people tend to pay more attention to the central region of images [7]. These two principles are used to filter FOA. Whole Effect: Let avg be the mean of salience value of the neighborhood n × n of FOA in salience map, V alueF OA be the salience value of the FOA, ξ be a constant and 0 < ξ < 15 . if avg < ξ × V alueF OA, then discard the FOA. Center Preference: Let W idth and Height be the image width and height respectively, d be the shortest distance between the FOA and the border of image, α be a 1 . If d < α × (W idth + Height), constant and 0 < α < 20 then discard the FOA.
3. ROIs extraction and representation
3.1.3. Shifting the Attention. Shifting the attention is inspired by the inhibition of return mechanism(IOR), which has been widely observed in human psychophysical experiments. The traditional approach of implementation of IOR just inhibits the single neuron in the saliency map at the current attended location. However, in fact, the biological IOR has been shown to be object-bound [9]. It means that when we shift the attention, it should track and follow the objects. Therefore, it seems feasible to obtain interesting target when we shift another different and distant salient locations. According to two intrinsic characteristics of object (spatial proximity and similarity preference), we extract the region of interest by using the focus of attention and conspicuity map. And then we implement the attention shift by inhibiting these neurons among the object-bound and obtain the regions of interest serially based on region grow.
Inspired by biological visual systems, in this section we describe a method for extraction of regions of interest from complex natural images based on image-driven, bottom-up visual attention model and then extract the color and texture features to represent each region of interest.
3.1. Regions of interest extraction We first adopt the classic Itti’s algorithm [6] to extract candidates of focus of attention (FOA) by using a bottom-up attention model. Due to the disturbance of noise and complex clutter, the FOA obtained by the algorithm may be isolated points. We propose two simple principles (whole effect and center preference) to adjust the FOA and finally obtain the regions of interest based on the FOA and conspicuity map. 3.1.1. Focus of attention. We obtain the focus of attention by using the Itti’s algorithm [6], which is defined as the maximum salient point of the saliency map, which is obtained by using WTA (Winner-take-all) network.. Firstly, the input image is decomposed by Gaussian pyramids and then early visual features like intensity, color, and orientation are extracted at each scale. Secondly, the feature maps, which are obtained following Centre-Surround differences operation and normalization, are combined into three conspicuity maps (F MI , F MC , F MO ) by across-scale addition. Finally, the conspicuity maps are normalized and summed into the saliency map SM .
3.2. ROIs representation Once regions of interest are identified, each region is characterized using a feature set. Considering the irregular shape of these regions, the color and texture features are extracted from each region. To obtain the color feature, we first convert image into the well-known HSI color space. And then, we compute first color moment (μ) and second color moment (σ) on each channel to represent the color feature FC for each region of interest. As the Gabor filter function simulates simple and complex cells in visual cortex and non-classical receptive field inhibition or surround suppression, it performs well on con-
3.1.2. Adjust the FOA. As a result of the various complex backgrounds and noise in natural images, there is
733
tour detection, image segmentation, and extraction of texture features [4] [5]. Therefore, we describe texture representation based on Gabor filter function. We firstly convert each region of interest into gray image f (x, y) and then compute the Gabor energy feature map, which models complex cells in visual cortex. The texture feature is then extracted based on the Gabor energy map. We compute the mean (Tθμ ) and standard deviation (Tθσ ) on each orientation and scale of the Gabor energy map. Consequently, we extract 128-dimension (16*4*2) texture feature set. In addition, as the “MAX-like” operation explains aspects of higher-level visual processing of object recognition in visual cortex and represents a sensible way of pooling responses to achieve feature invariance [8], we divide the Gabor energy map into 16 subblocks on each scale and orientation, and then take the max operation over all subblocks and orientations for all scales energy maps. In this way, we obtain another 16-dimension texture feature Timax , i=(1,2,. . . ,16). The texture feature is represented by FT = (Tθμ , Tθσ , Timax ). Finally, gaussian normalization is used to normalize these features.
After building the new training data set, we use EM-DD to obtain the instance prototypes for representing each class of instances. In EM-DD, let all instances of positive bags to be regarded as the initial points in feature space, and select one point to search the maximum DD value point using Diversity Density function in each iteration as these maximum points can represent the semantic notion of these instances excellently. Once we obtain points with large DD value, the instance prototypes from these points are determined by the threshold λ, which is the mean of these DD values. In this way, we obtain the corresponding instance prototypes H for each class of instances. After the instance prototypes are determined, we compute the distance between bags and instance prototype for each class to map every bag to a point in a new feature space, where the distance is defined as the minimum weighted hausdorff distance. Algorithm 1 The EMDD-SVM algorithm 1: Rearrange data set D M IM L = {(Xi , Li )}, (i = 1, 2, . . . , n) by cross training to obtain new data set the n DM IL = {(xi , Li )}, (i = 1, 2, . . . , i=1 |Li |), therefore, transform the MIML problem into the MIL problem temporarily. 2: for (each class label l ∈ L) do 3: Let B be the set of instances from all positive bags in DM IL and initialize Hl = ∅. 4: for (every instance in B as starting point for pi ) do 5: Initialize the weight si = 1. 6: Find the maximum DD value point (p, s) by Diversity Density function. 7: Hl = Hl ∪ (p, s). 8: end for 9: set λ = mean(p,s)∈Hl (DD(p, s)). 10: Traverse all these instance prototypes, if DD(pi , si ) < λ, Hl = Hl − {(pi , si )}. 11: Compute the minimum weighted Hausdorff distance d between bags and instance prototype Hl for each class to map every bag to a point in a new feature space and transform DlM IL into multi-label data set DlM LL = {(di , Li )}, (i = 1, 2, . . . , n). 12: Learn the classifer fl (d) = T rainM LSV M (DlM LL ) 13: end for 14: return L∗ = {l|fl (d∗ ) ≥ 0} ∪ {l| arg max(fl (d∗ ))}
4. Multi-instance Multi-label Learning After the extraction of ROI and representation, in this section, we propose two new multi-instance multi-label learning (MIML) algorithms to realize scene classification. The basic idea of our MIML algorithm is to integrate the multi-instance learning and multi-label learning. Firstly, the training data set is rearranged to transform the problem into the multi-instance problem. And then, the EM-DD algorithm [11] is adopted to obtain the instance prototypes, with each of them representing one class of instances. Secondly, we compute the minimum weighted Hausdorff distance between bags and instance prototypes for each class to map every bag to a point in a new feature space, and consequently transform the problem to multi-label problem. Finally, we use the multi-label algorithms MLSVM [2] and MLKNN [10] as learners to achieve the multi-semantic classification.
4.1. EMDD-SVM and EMDD-KNN Firstly, We rearrange the training data set by cross training. For each class label, we traverse all bags in training data set. If the bag is associated with the current class label, the bag is regarded as the positive bag for this class label; otherwise, it is regarded as the negative bag. In this way, all bags are just associated with one class label, which means that the multi-instance multi-label problem is transformed into the classical multi-instance learning problem temporarily.
Therefore, the original multi-instance multi-label data set DM IM L are transformed into the multi-label data set DM LL = {(di , Li )}, (i = 1, 2, . . . , n). Finally, we use MLSVM [2] to learn the multi-label problem. In making predictions, the T-Criterion [2] is used, which corresponds to the Step 14 of the EMDD-SVM algorithm. That is the test instance that is labeled by all the class labels with positive SVM scores and is labeled by the class labels with the
734
Algorithm 2 The EMDD-KNN algorithm 1: Step 1 to Step 11 in Algorithm 1. 2: for (each class label l ∈ L) do l 3: Compute the prior probabilities P (Tb∈[−1,1] ), where l l T1 and T−1 denote the event an instance is associated with label l or not respectively. 4: Counts the number of neighbors of x belonging to the l − th class Nx (l) for each training instance x. l 5: Estimate posterior probabilities P (EN |Tbl ). x (l) 6: FOR each test instance y, return the output yt based on the maximum estimated probabilities. yt = l |Tbl ). arg maxb∈[−1,1] P (Tbl )P (EN x (l) 7: end for
terest extraction are shown in Figure 2. These results from Figure 1 and Figure 2 indicate that the extraction regions of interest using our approach can achieve a good performance. Meanwhile, it can extract the potential generic objects under different clutters effectively as well as shows strong robustness to different noise.
5.2. Categorization Results After we obtain the feature set of these ROIs, half of the sample data are selected randomly to work as training dataset, with the rest of the data working as testing dataset. Table 1 gives our experiment result with EMDD-SVM algorithm. From it, it is obvious that EMDD-SVM algorithm has excellent classification performance for multi-instance multi-label task, with linear kernel performing the best. Table 2 gives our experiment result with EMDD-KNN algorithm. In this experiment, we set N, the number of nearest neighbor, 5 to 14. From Table 1 and Table 2 we can see that, the two MIML algorithms can both achieve good performance on multi-semantic scene classification, where the performance of EMDD-SVM with linear kernel achieve the best experimental categorization result (hamming-loss:0.173 ± .002, ranking-loss:0.163 ± .003, one-error:0.309 ± .005, coverage:0.896±.010, average-prec.:0.802±.002). On the other hand, in the EMDD-KNN, it presents best performance when the number of nearest neighbors becomes largest (N = 14) on most evaluation metrics. We also compare our algorithms with MIMLBoost and MIMLSVM algorithms, which proposed by Zhou and Zhang recently. The basic idea of their algorithms is to handle the multi-instance multi-label problem using multiinstance learning and multi-label learning as bridge respectively. In the MIMLBoost, the MIML problem is first transformed into the multi-instance problem and further be transformed into a traditional supervised learning task. Similarly, in the MIMLSVM, the MIML problem is first transformed into the multi-label problem by clustering algorithm and then is transformed into the traditional supervised learning task. The best experimental results of their algorithms are shown as table 3 and table 4. The main difference between us is that we integrate the multi-instance learning and multi-label learning simultaneously to handle this problem. Comparing Tables 1 to 4, we can find that EMDD-SVM algorithm apparently achieve better performance on scene classification than MIMLBoost and MIMLSVM algorithms on all evaluation metrics (from hamming-loss to averageprec.). The worst performance of EMDD-SVM with RBF kernel is better than the best performance of MIMLBoost (with boosting rounds=20). The performance of EMDDSVM is 3% higher than the performance of MIMLSVM in terms of average-prec.
top score when all the SVM scores are negative. The algorithm is described in Algorithm 1. We also propose EMDD-KNN to deal with the difficulties of multi-instance multi-label problem. At first, we transform the multi-instance multi-label problem into the multi-instance learning and obtain the instance prototypes so that the problem finally transform into the multi-label problem, which corresponds to the Step 1 to Step 11 in Algorithm 1. The difference between the EMDD-SVM and EMDD-KNN is that the latter uses the MLKNN [10] to learn the multi-label problem. The algorithm is presented in Algorithm 2.
5. Experimental Result and Analysis The data set consists of 1163 natural scene images, which include five distinct categories of building, trees, mountains, sunset and beach. Over 26% images belong to multiple semantic classes. Some of them are from the COREL image collection while others are collected directly from the Internet. In order to facilitate the view and processing, the size of some images are adjusted. We perform our experiments on Intel Pentium IV 2.4GHz CPU, memory 512MB PC under the Matlab7.0 environment.
5.1. Evaluation on ROIs extraction In this experiment, all five distinct categories of scene images are tested and partial results of the regions of interest extraction are shown in Figure 1, where the number of regions of interest is determined according to the complexity of images automatically and also be restricted to less than seven regions for each image by program to fit real-world problems. Furthermore, to evaluate the influence of noises imposed on our approach of extracting regions of interest, we add Gaussian noise and Salt & pepper noise respectively to images. The partial experimental results of the region of in-
735
Figure 1. Experimental results of ROIs extracting Original image
Gaussian noise
Salt & pepper noise
Figure 2. Experimental results of extracting ROIs with different noises
Kernel Linear RBF Polynomial
Table 1. Performance of EMDD-SVM with different kernels Evaluation metrics hamming-loss ranking-loss one-error coverage average-prec. 0.173 ± .002 0.163 ± .003 0.309 ± .005 0.896 ± .010 0.802 ± .002 0.218 ± .007 0.232 ± .013 0.438 ± .017 1.17 ± .055 0.722 ± .013 0.200 ± .001 0.192 ± .001 0.364 ± .003 1.02 ± .003 0.767 ± .001
Table 2. Performance of EMDD-KNN with different numbers of nearest neighbors Number of Nearest Evaluation metrics Neighbors hamming-loss ranking-loss one-error coverage average-prec. N=5 0.212 ± .006 0.236 ± .012 0.433 ± .027 1.17 ± .049 0.726 ± .017 N=6 0.210 ± .007 0.235 ± .013 0.436 ± .032 1.17 ± .053 0.724 ± .019 N=7 0.212 ± .011 0.231 ± .010 0.428 ± .020 1.15 ± .047 0.729 ± .014 0.734 ± .014 N=8 0.206 ± .007 0.228 ± .011 0.422 ± .020 1.14 ± .052 N=9 0.213 ± .005 0.226 ± .010 0.418 ± .022 1.13 ± .043 0.737 ± .014 N = 10 0.211 ± .005 0.225 ± .012 0.419 ± .026 1.13 ± .053 0.735 ± .017 N = 11 0.208 ± .003 0.221 ± .010 0.413 ± .019 1.11 ± .044 0.739 ± .013 0.413 ± .020 1.11 ± .044 0.739 ± .014 N = 12 0.209 ± .002 0.222 ± .010 N = 13 0.207 ± .005 0.225 ± .011 0.425 ± .026 1.13 ± .049 0.732 ± .016 N = 14 0.205 ± .004 0.221 ± .012 0.410 ± .023 1.11 ± .058 0.740 ± .017
736
Boosting rounds 5 10 15 20
Kernel Linear RBF Polynomial
Table 3. Performance of MIMLBoost Evaluation metrics hamming-loss ranking-loss one-error coverage 0.228 0.347 0.448 1.39 0.234 0.276 0.515 1.28 0.218 0.276 0.522 1.32 0.225 0.276 0.511 1.30
hamming-loss 0.191 ± .002 0.232 ± .027 0.225 ± .001
average-prec. 0.640 0.670 0.664 0.674
Table 4. Performance of MIMLSVM Evaluation metrics ranking-loss one-error coverage 0.188 ± .002 0.363 ± .008 0.987 ± .009 0.259 ± .036 0.476 ± .069 1.28 ± .152 0.204 ± .001 0.382 ± .002 1.05 ± .003
6. Conclusion and future work
average-prec. 0.772 ± .003 0.691 ± .046 0.755 ± .001
[3] Y. Chen and J. Z. Wang. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research, 5:913–939, 2004. [4] C. Grigorescu, N. Petkov, and M. A. Westenberg. Contour detection based on nonclassical receptive field inhibition. IEEE Transactions on Image Processing, 12(7):729– 739, 2003. [5] S. E. Grigorescu, N. Petkov, and P. Kruizinga. Comparison of texture features based on gabor filters. IEEE Transactions on Image Processing, 11(10):1160–1167, 2002. [6] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254– 1259, 1998. [7] C. M. Privitera and L. W. Stark. Algorithms for defining visual regions-of-interest: comparison with eye fixations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(19):970–982, 2000. [8] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019–1025, 1999. [9] T. Ro and R. D. Rafal. Components of reflexive visual orienting to moving objects. Perception and psychophysics, 61:826–836, 1999. [10] M. L. Zhang and Z. H. Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007. [11] Q. Zhang and S. A. Goldman. EM-DD: an improved multiple-instance learning technique. in: ietterich T G, Becker S, Ghahramani Z, eds. Advances in Neural Information Processing Systems 14, pages 1073–1080, 2002. [12] Z. H. Zhou and M. L. Zhang. Multi-instance multilabel learning with application to scene classification. in: Advances in Neural Information Processing Systems (NIPS’06), Vancouver, Canada, pages 1609–1616, 2006.
In this paper, we propose novel techniques to implement the multi-semantic scene classification based on region of interest. Firstly, we present a new approach to extract ROIs from each image motivated by biological visual mechanism. The FOA is first extracted by using bottom-up, saliencedriven attention model proposed by Itti etc. and then we propose the principles of whole effect and center preference to adjust it. After that, regions of interest are obtained based on the FOA and conspicuity saliency map, depending on the characteristics of object spatial proximity and object similarity. Furthermore, we propose two MIML algorithms, EMDD-SVM and EMDD-KNN, by integrating the multi-instance learning and multi-label learning for the multi-semantic scene classification task. Experimental results show that our ROI extraction algorithm could obtain different kinds of interested objects effectively under various complex clutters and is highly tolerant to the noise, and that EMDD-SVM and EMDD-KNN have achieved good performance on multi-semantic scene classification, with EMDD-SVM outperforming MIMLBoost and MIMLSVM. Ongoing work will also concentrate on the experimental evaluation to larger scale problems. The proposed techniques also will be extended to implement image semantic retrieval and automatic machine navigation.
References [1] A. Bosch, X. Muoz, and R. Marti. A review: Which is the best way to organize/classify images by content? Image and Vision Computing, 25(6):778–791, 2007. [2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004.
737