Bag-of-Visual-Words Based on Clonal Selection ... - Semantic Scholar

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 8, NO. 4, JULY 2011

691

Bag-of-Visual-Words Based on Clonal Selection Algorithm for SAR Image Classification Jie Feng, L. C. Jiao, Senior Member, IEEE, Xiangrong Zhang, Member, IEEE, and Dongdong Yang

Abstract—Synthetic aperture radar (SAR) image classification involves two crucial issues: suitable feature representation technique and effective pattern classification methodology. Here, we concentrate on the first issue. By exploiting a famous image feature processing strategy, Bag-of-Visual-Words (BOV) in image semantic analysis and the artificial immune systems (AIS)’s abilities of learning and adaptability to solve complicated problems, we present a novel and effective image representation method for SAR image classification. In BOV, an effective fused feature sets for local feature representation are first formulated, which are viewed as the low-level features in it. After that, clonal selection algorithm (CSA) in AIS is introduced to optimize the prediction error of k-fold cross-validation for getting more suitable visual words from the low-level features. Finally, the BOV features are represented by the learned visual words for subsequent pattern classification. Compared with the other four algorithms, the proposed algorithm obtains more satisfactory and cogent classification experimental results. Index Terms—Bag-of-Visual-Words (BOV), clonal selection algorithm (CSA), feature fusion, synthetic aperture radar (SAR) image classification.

I. I NTRODUCTION

S

YNTHETIC aperture radar (SAR) can obtain highresolution radar images with enriched information of amplitude, phase, and polarization under all weather, day and night, and long-distance conditions. Therefore, it has made great contribution to military defense and civil applications. Unfortunately, the uncertain speckle noise is present in the process of SAR imaging, which could induce trouble for identifying objectives of interest. Consequently, the study of efficient SAR image classification methodology [1], [2] is still a challenging subject very worthy to be studied. Feature extraction and presentation is the first crucial point in SAR image classification. Many feature extraction methods

Manuscript received August 25, 2010; revised November 16, 2010; accepted December 7, 2010. Date of publication February 6, 2011; date of current version June 24, 2011. This work was supported in part by the National High Technology Research and Development Program (863 Program) of China under Grant 2008AA01Z125, by the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) under Grant B07048, by the Fundamental Research Funds for the Central Universities under Grants JY10000902001 and K50510020015, by the National Natural Science Foundation of China under Grants 60803097 and 61003199, and by the Key Project of Ministry of Education of China under Grant 108115. The authors are with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LGRS.2010.2100363

have been proposed [1]–[3]. In recent years, the Bag-of-VisualWords (BOV) [4]–[10] algorithm in image semantic analysis has attracted much interest from optical image processing researchers. It presents a “midlevel” feature representation in favor of reducing the semantic gap between the low-level features and the high-level concepts of land cover types [5]. This novel feature presentation technique has been successfully applied to generic visual categorization [4], [6], texture categorization [7], and object classification of aerial image [5]. In this letter, the BOV algorithm is applied to SAR image classification. First, the fused features by Gabor filter [3] and Gray Level Co-occurrence Matrix (GLCM) [1] are viewed as the low-level features in BOV, which can provide complementary information in different frequency bands. Then, the lowlevel features of training data are quantized to form the visual words and represented as the BOV features using the learned visual words. The construction of visual words is a crucial step in the whole process. Many quantization methods have been proposed. The most popular one is k-means [5]–[7]. Alternative methods include mean shift [8] and extremely randomized trees [9]. It is noted that the information of category labels is not utilized efficiently in these methods. In [10], specific category words are trained by Gaussian Mixture Models (GMMs). Although label information is considered, the number of visual words increases sharply with very large number of categories. In [4], authors coped with this issue and combined universal words with specific class words. The results are relatively well, but they emphasized differences between universal words and class words instead of differences between classes. Additionally, the great number of visual words could bring trouble to both k-means and Expectation-Maximization (EM) used for parameters estimation in GMM because of their sensitivities to the initialization and weak local searching abilities. Considering the issues in the previous paragraph, we present a novel BOV based on clonal selection algorithm (CSA-BOV). A global and efficient searching method, clonal selection algorithm (CSA) [11] is devised to search the proper visual words. Moreover, the final purpose of our task is accurate classification. Therefore, k-fold cross validation [12] for training data of BOV features is employed here. Its prediction error is considered as CSA’s objective function, which could directly estimate the final classification performance [13]. II. CSA-BOV The motivation of image classification is to group the similar features to the same categories. The low-level features, such as

1545-598X/$26.00 © 2011 IEEE

692


Fig. 1. Bag-of-Visual-Words based on clonal selection algorithm for SAR image classification.

color, texture, and shape, extracted from the image are usually significantly different from the semantic categories in the form of text. Thus, we need feature representation method to describe the image semantics efficiently. The BOV technique draws enlightenment from text categorization in terms of both form and semantics. To classify a SAR image, the basic framework of CSA-BOV is shown in Fig. 1. We can see that the algorithm consists of three stages: preprocessing and features extraction, BOV construction, and machine learning. The watershed algorithm at the first stage groups pixels into local homogeneous patches. The corresponding local features of Gabor and GLCM are extracted as the low-level features. In BOV, CSA is devised to determine the visual words and BOV features are represented by a histogram statistics distribution of each visual word in the local patches. Finally, support vector machine (SVM) is used to BOV features classification for its advantage for good generalization performance. A. Preprocessing and Features Extraction In this letter, the well-known watershed transformation [14] is employed to group pixels into the local, coherent, and homogeneous image patches. It is beneficial for simplifying classification process since the number of pixels of SAR imagery is very high even for image with small size and at moderate resolution. In this letter, about 1000 patches are obtained in an image with size of 256 × 256. Superior and suitable feature extraction methodology could improve the overall quality of final classification. Texture is an important characteristic for identifying land covers. Up to now, many texture features have been proposed. The study of features fusion [1] has attracted much more attention currently. Here, both GLCM and Gabor filter are formulated for image local representation. The reason for this is that Gabor filter is able to accurately capture low frequency texture information and GLCM is relevant to higher frequency band response [3]. GLCM is a stable statistical texture extraction method even if the imagery is contaminated by speckle noise. Advocated by the literature [1], we set the adjacent distance to be 1 and statistical directions to be 0◦ , 45◦ , 90◦ , 135◦ . The image quantization level is 16 and the window size is 9 × 9. Texture features are extracted by GLCM in three famous statistics (contrast, entropy and correlation). The Gabor function mimics the receptive field of visual cortex of human. The different values of frequency and direction of the sine wave can indicate the multi-orientation and multi-

scale features of Gabor filter, which have been investigated in the literature [3]. By their conclusions, three orientations (θ = 0◦ , 60◦ , 120◦ ) and six center frequencies (F = 7.8769, 4.5310, 4.0960, 3.9084, 3.5804, 2.6806) are chosen in this study. In addition, for eliminating the dominance effect in distance calculation of certain feature, all features are normalized into the range [0, 1]. B. Construction of Visual Words The visual words can be obtained from low-level features of training data. It is a complicated searching and optimization process in the high-dimensional feature space. Two key problems are usually considered: efficient solution and object function. Generally, k-means as an unsupervised algorithm is to cluster low-level features by interclass variance and define each cluster center as a visual word. Supervised GMM models a visual word with each Gaussian assumption. The estimation of Gaussian parameters is performed by EM to maximize the log-likelihood function. However, both k-means and EM have obvious drawbacks for their sensitivities to initialization and weak local searching abilities. Besides, the construct of visual words is just an intermediate step during the whole process. We are not concerned with the clustering results in the feature space, but the final classification accuracy [6]. As a result, it is necessary to utilize a better optimization method and fitter optimization criteria. These are the focused issues in the study. Evolutionary algorithms (EAs) are a type of adaptive artificial intelligence techniques which use computational models of evolutionary processes as key elements in the design of computer-based problem-solving systems. Different computing paradigms in EAs are proposed by simulating different biological principles, such as genetic algorithm modeling evolution process of species in nature, particle swarm optimization simulating bird flight. Artificial Immune Systems (AIS) are adaptive systems inspired by the human immune system and have received a significant amount of interest. CSA inherits the artificial immune system’s abilities of learning and adaptability to solve complicated problems. With the characteristics of its easier application and greater efficiency, it has been widely applied for combinatorial optimization [11], image segmentation [15], and other engineering problems [16]. The population-based searching mechanism and global evolutionary methodology of CSA just could fill up the deficiencies of k-means and EM. All of these advantages inspired us to apply it to search the visual words from low-level features of

FENG et al.: BAG-OF-VISUAL-WORDS BASED ON CLONAL SELECTION ALGORITHM

693

TABLE I C LONAL S ELECTION A LGORITHM (CSA)

TABLE II S PECIFIC PARAMETERS S ETTINGS IN CSA-BOV

training data. The procedure of CSA is described in Table I. Here, each individual of length K × d bits (genes) represents the visual words by integer encoding. K is the number of the visual words and d is the feature dimension. A visual word consists of d genes in the individual. Objective Function: The effective visual words are inseparable from the classification task. The performance of a classifier is estimated by the prediction error here. A popular estimator of the prediction error is k-fold cross validation. Specifically, a data set D of size n is randomly partitioned into k approximately equal size subsets D = {D1 , . . . , Dk }, then, k iterations of classifier Fi are performed. Each of them is a different choice of a test subset Di while the remaining k − 1 subsets D \ Di are induced for training. Here, BOV features of training data are determined by the visual words, then the prediction error of k-fold cross validation is devised as the objection function and computed as (1) [12]:

Fig. 2. Classification results of image 1. (a) Original image 1. (b)–(f) Classification results by CSA-BOV, GMM-BOV, k-BOV, LLFC, and HMTSeg, respectively.

P =

k 1 δ (Fi (D \ Di , v), c) . n i=1

(1)

(v,c)∈Di

Note that δ(i, u) = 1 if i = u, and 0 otherwise. v and c are the instance and its corresponding label in the testing data Di . Affinity Maturity: Affinity maturity operation is the individual perturbation at its local region. Here, one-point crossover [15] is introduced and a novel two-stage mutation operator is devised. At the first stage, certain visual words are selected stochastically with probability 0.1. Then, during the selected visual words, the gene locations are chosen from d by probability 0.3 for local perturbation. If the value of gene location is x, it becomes rand[0, x], when x > 0.5, otherwise rand[x, 1] after perturbation. “rand[0, x]” means a random number is created in the range[0, x]. III. E XPERIMENTAL R ESULTS AND A NALYSIS Specific parameters settings for CSA are presented in Table II. The proper value of k in k-fold cross validation is 5 or 10 (by J. D. Rodriguez [13]) to measure the prediction error. Here, we choose the minimal one for simplifying the complexity of the process. At the last stage in Fig. 1, SVM with one-against-all approach is exploited to deal with multiclass classification. Radial basis function kernel with C = 108 and γ = 10−4 is used as its kernel function.

A. Classification Result We will investigate the performance of the CSA-BOV in the feature representation compared with low-level feature classification (LLBC), in the construction of visual words with representative BOV based on k-means (k-BOV) and BOV based on GMM (GMM-BOV). In GMM-BOV, the visual words consist of specific class words learned by GMM. Besides, famous wavelet domain Hidden Markov Model (HMTSeg) [17] is applied for comparison too. The same low-level features are adopted in k-BOV, GMM-BOV, LLBC, and CSA-BOV except that Haar wavelet with three levels of decomposition is used in HMTSeg for keeping the original performance of the algorithm. Furthermore, the 64 × 64 image blocks of each category are used for model training in HMTSeg. In k-BOV, GMM-BOV, CSA-BOV, and LLBC, the training set is obtained by the following procedure. There are 50 and 60 points per-class whose labels are known for experimental image 1 and image 2, then image patches containing at least one known point are selected and last, the training set consists of all pixels locating in selected patches and is about 15% of total data in two images. The image (256 × 256 pixels), as shown in Fig. 2(a), is a part of a Ku-band SAR image with 3-m spatial resolution in the area of California, USA. There are three types of land covers: runway, pavement, and building. The ground truth class labels of the whole SAR imagery are difficult to obtain, so we choose some representative points. The test set is got by those points in the same way as the training set. The number of those points of runway, pavement, and building is 258, 552, and 389, respectively. Fig. 2(b)–(f) shows the experimental results of the five algorithms. It could be seen that the pavement (green) in the middle of the image are misclassified as the buildings (white) in Fig. 2(f). Some pixels are not correctly defined in the runway (black) and pavement in Fig. 2(e). Besides, the boundaries

694


TABLE III ACCURACY AND K APPA C OEFFICIENT OF C LASSIFICATION R ESULTS OF THE SAR I MAGE 1

TABLE IV ACCURACY AND K APPA C OEFFICIENT OF C LASSIFICATION R ESULTS OF THE SAR I MAGE 2

Fig. 4. Mean classification accuracy of CSA-BOV, GMM-BOV, k-BOV, and LLFC by adding simulated speckle from 1-look to 16-look. The point A in x-axis represents original synthesized texture image. Fig. 3. Classification results of image 2. (a) Original image 2. (b)–(f) Classification results by CSA-BOV, GMM-BOV, k-BOV, LLFC, and HMTSeg, respectively.

between the main runway and the pavement are not well defined in Fig. 2(c) and (d). Nevertheless, CSA-BOV could present the best classification result as shown in Fig. 2(b). It is worthwhile to notice that the runway at some narrow regions in the lower left, upper middle and lower right of the image 1 is difficult to discriminate. From the results of five algorithms, only CSABOV could clearly recognize those little line objectives near the main runway. Table III presents the average classification accuracy and kappa coefficient on image 1 over 30 independent runs. It is shown that CSA-BOV has obtained the best statistical value in the two indexes. Another experiment is carried out on a 1-m spatial resolution X-band Terra-SAR subimage (512 × 512 pixels) of Swabian Jura, Germany, as shown in Fig. 3(a). The image consists of six typical land covers: vegetation, urban area and four types of crops. There are 1021, 1172, 459, 600, 729, and 917 representative points generating the test data for them, respectively. Fig. 3(b)–(f) shows the experimental results of SAR image 2. We can see that vegetation and urban areas are mixed together and very hard to distinguish, as shown in Fig. 3(a). The certain vegetation regions (green) are misclassified as the urban areas (black) in Fig. 3(d) and (e). In Fig. 3(f), the uniformity of the magenta crops improves but the boundaries between classes are not well defined. The CSA-BOV and GMM-BOV algorithms, shown in Fig. 3(b) and (c), achieve relatively better results among the five algorithms. Furthermore, CSA-BOV can perform the best in terms of the uniformity in magenta and cyan crops and precise localization in boundary between blue and magenta crops. Table IV presents the accuracy and kappa coefficient of five algorithms. We can see that the best statistical results of CSA-BOV are obtained, which are in agreement with the visual partition results in Fig. 3.

B. Sensitivity to the Number of Looks of Speckle The difficulty of SAR image classification is its intrinsic unpredicted and inestimable speckle noise. To evaluate the robustness of the proposed algorithm under the effect of speckle, a synthesized texture image with four categories is employed and different looks of simulated speckle are added into it. The results are shown in Fig. 4. The larger number of look is, the weaker speckle is. The curves in Fig. 4 demonstrated that the mean classification accuracies of four algorithms are nearly the same at the point “A” locating at horizontal axis. While, for the dynamic performance of them in different levels of speckle, we can see that the decline of the accuracy with CSA-BOV is slower than that of the other three algorithms. The cause of this instability may be that k-means and EM are easy to sink to local optima. Besides, it is worthwhile to note that the curves of three types of BOV algorithms decline slower than LLFC. It is obvious that BOV features are more stably and effective when the speckle exists, especially in CSA-BOV. C. Influence of the Number of Training Data Training set size will influence the classification results. We give an analysis of this influence by reducing the number of the known points producing training set from 100% to 20% in each class on the image 2. The classification results are shown in Fig. 5. It can be seen that when the number of known points is less than 60%, the curves of average classification accuracy with three algorithms decline sharply. The reason may be that the number of training set per-class is too small to be discriminated. Additionally, the curve of CSA-BOV declines slower than that of GMM-BOV and k-BOV with 100% to 60% of known points. Besides, the CSA-BOV gets higher accuracy during the whole vary interval. Thus, the performance of CSABOV is more reliable with less training set.

FENG et al.: BAG-OF-VISUAL-WORDS BASED ON CLONAL SELECTION ALGORITHM

695

IV. C ONCLUSION We have developed a novel and efficient CSA-BOV algorithm for SAR image classification. In comparison with the other four algorithms, this method has obtained promising classification results on two SAR images. CSA-BOV avoids up the drawbacks of k-BOV and GMM-BOV by an evolutionary searching methodology. Moreover, more suitable visual words are constructed from an optimization viewpoint. The future work will focus on analyzing more stable features to disturbance of the speckle noise by integrating abundance information, such as spatial relation.

Fig. 5. Average classification accuracy of CSA-BOV, GMM-BOV, and k-BOV as the number of known points varies from 100% to 20% on image 2 per class.

Fig. 6. (a) Mean classification accuracy of CSA-BOV on image 2 as the number of visual words varies from 200 to 700. (b) The convergence curve of CSA-BOV from 0 to 30 generations on image 2.

D. Analysis of Parameters and Time Complexity in CSA-BOV There are two types of free parameters in this study, which are the number of visual words and the parameters caused by CSA. We investigate the number of visual words in Fig. 6(a). It shows the mean classification accuracy of CSA-BOV at different numbers of visual words on the image 2. We can obtain that the performance rises gradually firstly, peaks at 600, and then drops slowly. At last, 600 for image 2 are suitable choice. Furthermore, the curve of the image 1 has the similar tendency, 300 is selected in it. In CSA, several parameters are involved, such as crossover, mutation probability and the population size. However, the number of generation is the one worthy of serious study. Fig. 6(b) shows the curve of convergence with CSA-BOV on the image 2. We can see that fitness value slightly reduces and fully converges when the generation is 30. The same conclusion is obtained in the image 1. Besides, time complexity of an algorithm is very important for practical engineering. By systematical analysis, the time complexity of both clonal and affinity maturity operations is O(CS × dK), so the total time complexity is O(CS × dK) in a generation.

R EFERENCES [1] D. A. Clausi, “Comparison and fusion of co-occurrence, Gabor and MRF texture features for classification of SAR sea ice imagery,” Atmos. Ocean, vol. 39, no. 4, pp. 183–194, 2001. [2] V. V. Chamundeeswari, D. Singh, and K. Singh, “An analysis of texture measures in PCA-based unsupervised classification of SAR images,” IEEE Geosci. Remote Sens. Lett., vol. 6, no. 2, pp. 214–218, Apr. 2009. [3] D. A. Clausi and H. Deng, “Design-based texture feature fusion using Gabor filters and co-occurrence probabilities,” IEEE Trans. Image Process., vol. 14, no. 7, pp. 925–936, Jul. 2005. [4] F. Perronnin, “Universal and adapted vocabularies for generic visual categorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 7, pp. 1243–1256, Jul. 2008. [5] S. Xu, T. Fang, D. Li, and S. W. Wang, “Object classification of aerial images with bag-of-visual words,” IEEE Geosci. Remote Sens. Lett., vol. 7, no. 2, pp. 366–370, Apr. 2010. [6] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Proc. Eur. Conf. Comput. Vis. Workshop Stat. Learn. Comput. Vis., Graz, Austria, 2006, pp. 1–22. [7] L. Qin, W. Q. Wang, Q. M. Huang, and W. Gao, “Unsupervised texture classification: Automatically discover and classify texture patterns,” Image Vis. Comput., vol. 26, no. 5, pp. 647–656, May 2008. [8] F. Jurie and B. Triggs, “Creating efficient codebooks for visual recognition,” in Proc. 10th IEEE Int. Conf. Comput. Vis., Beijing, China, 2005, pp. 604–610. [9] R. Maree, P. Geurts, J. Piater, and L. Wehenkel, “Random subwindows for robust image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., San Diego, CA, 2005, pp. 34–40. [10] J. Farquhar, S. Szedmak, H. Meng, and J. Shawe-Taylor, “Improving ‘bagof-keypoints’ image categorisation,” Univ. Southampton, Southampton, U.K., Tech. Rep., 2005. [11] L. N. de Castro and F. J. Von Zuben, “Learning and optimization using the clonal selection principle,” IEEE Trans. Evol. Comput., vol. 6, no. 3, pp. 239–251, Jun. 2002. [12] M. Stone, “Cross-validatory choice and assessment of statistical predictions,” J. R. Stat. Soc. Ser. B, vol. 36, no. 1, pp. 111–147, 1974. [13] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity analysis of k-fold cross validation in prediction error estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 569–575, Mar. 2010. [14] L. Vincent and P. Soille, “Watersheds in digital spaces: An efficient algorithm based on immersion simulations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 6, pp. 583–598, Jun. 1991. [15] E. R. Hruschka, R. J. G. B. Campello, A. A. Freitas, and A. C. P. L. F. de Carvalho, “A survey of evolutionary algorithms for clustering,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 39, no. 2, pp. 133–155, Mar. 2009. [16] F. Campelo, F. G. Guimaraes, and H. Igarashi, “A clonal selection algorithm for optimization in electromagnetics,” IEEE Trans. Magn., vol. 41, no. 5, pp. 1736–1739, May 2005. [17] H. Choi and R. G. Baraniuk, “Multiscale image segmentation using wavelet-domain hidden Markov models,” IEEE Trans. Image Process., vol. 10, no. 9, pp. 1309–1321, Sep. 2001.