Scene Capture and Selected Codebook-Based Refined ... - IEEE Xplore

0 downloads 0 Views 6MB Size Report
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING. 1. Scene Capture and Selected Codebook-Based. Refined Fuzzy Classification of Large.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

1

Scene Capture and Selected Codebook-Based Refined Fuzzy Classification of Large High-Resolution Images Li Yan , Member, IEEE, Ruixi Zhu , Yi Liu, and Nan Mo

Abstract— Scene classification has been successfully applied to the semantic interpretation of large high-resolution images (HRIs). The bag-of-words (BOW) model has been proven to be effective but inadequate for HRIs because of the complex arrangement of the ground objects and the multiple types of land cover. How to define the scenes in HRIs is still a problem for scene classification. The previous methods involve selecting the scenes manually or with a fixed spatial distribution, leading to scenes with a mixture of objects from different categories. In this paper, to address these issues, a scene capture method using adjacent segmented images and a support vector machine classifier is proposed to generate scenes dominated by one category. The codebook in BOW is obtained from clustering features extracted from all the categories, which may lose the discrimination in some vocabularies. Thus, more discriminative visual vocabularies are selected by the introduced mutual information and the proposed intraclass variability balance in each category, to decrease the redundancy of the codebook. In addition, a refined fuzzy classification strategy is presented to avoid misclassification in similar categories. The experimental results obtained with three different types of HRI data sets confirm that the proposed method obtains classification results better than those obtained by most of the previous methods in all the large HRIs, demonstrating that the selection of representative vocabularies, the refined fuzzy classification, and the scene capture strategy are all effective in improving the performance of scene classification. Index Terms— Bag of words (BOW), intraclass variability balance (IVB), large high-resolution images (HRIs), mutual information, refined fuzzy classification strategy, scene capture strategy, selection of representative vocabularies.

I. I NTRODUCTION

R

ECENTLY, the number of different sensors has greatly expanded, and they now offer us massive amounts of so-called high-resolution images (HRIs) with a spatial resolution up to 0.31 m. However, due to the complex arrangements of the ground objects and the multiple types of land cover in HRIs, scene-level land-use classification is still a challenging task, which has attracted broad attention [1].

Manuscript received October 3, 2017; revised March 9, 2018; accepted April 14, 2018. This work was supported in part by the National Key Research and Development Program of China under Grant 2016YFC0802500 and in part by the Key Research and Development Program of Jiangxi Province under Grant 20171BBE50062. (Corresponding authors: Ruixi Zhu; Yi Liu.) The authors are with the School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2018.2828314

Different types of scene classification methods have been proposed over the past decades. As mentioned in [2], scene classification methods can be classified into three kinds: methods based on low-level features, such as the scale-invariant feature transform (SIFT) [3], local binary patterns [4], color histogram [5], and the Gist descriptor [6]; methods relying on mid-level visual representations, such as bag of words (BOW) [7], probabilistic latent semantic analysis (PLSA) [8], and latent Dirichlet allocation (LDA) [9]; and methods based on high-level vision information, such as OverFeat [10], CaffeNet [11], and GoogLeNet [12]. The traditional methods based on low-level and mid-level representations are unable to capture the complex semantic concepts of HRIs. This may lead to a divergence between the low-level data and the high-level semantic information, namely, the so-called semantic gap [13]. It is difficult for low-level methods to depict the high diversity and the nonhomogeneous spatial distributions in HRIs without describing detailed information. Methods based on mid-level representations are statistics of the low-level representations, but they are not as effective at capturing high-level features as high-level convolutional neural networks (CNNs). High-level CNNs work well in data sets with segmented images, such as the UC Merced data set, but they may perform poorly in large-scale complex HRIs because of the inadequate training samples or the inappropriate segmentation of the HRI [14]. Among the different scene classification methods, the BOW model has been successfully applied to the scene classification of HRIs [15]–[17]. In order to improve the ability to capture scale invariance, various methods have been developed. Lazebnik et al. [18] proposed a spatial pyramid matching (SPM) framework by partitioning the image into increasingly fine subregions and computing histograms of the local features found inside each subregion. Ojala et al. [4] proposed a multiscale gray-scale and rotation invariant texture classification method to decrease the influence of the gray-scale value on the classification. Zhao et al. [19] proposed the concentric circlestructured multiscale bag-of-visual-words model to handle the problem of rotation and scale transformation. However, the above-mentioned methods all cluster features from all the categories into visual vocabularies. This strategy will mix features which may provide important discriminative cues for multiclass categorization. As can be seen in Fig. 1, the features extracted from the patches in the “parking lot” scene and those extracted from the patches in the “harbor” scene may

0196-2892 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

Fig. 1. Two similar patches from different scenes of harbor and parking lot.

be grouped in the same cluster due to the similarity of these two types of patches, leading to similar clusters and BOW representations. Therefore, this visual word-based approach loses its discriminative ability in separating these two scenes, and more discriminative codebooks need to be generated. In order to solve this problem, fusion of the visual vocabularies in each category derived from the clustering features has been proposed [20]. The visual words created in this way have been proven to be more discriminative in multilabel classification [21]. However, images from different categories may have similar regions, resulting in the same visual word belonging to more than one category. Therefore, a selection of visual words in each category is performed using introduced mutual information (MI) combined with the proposed intraclass variability balance (IVB) to decrease the redundancy in the visual words. MI [22] and the proposed IVB measure the representativeness and generalization of visual vocabularies in each category, respectively. In the proposed method, visual vocabularies at a scale as large as the whole image are combined with those at smaller scales. The selected visual words from each scale in each category are then fused to form the final codebook. Moreover, the previous BOW-based classification methods usually input the designed features into certain classifiers (e.g., k-nearest neighbor [23], support vector machine [24], and random forest [25], [26]) for the final classification, which is called “one-step” classification in this paper. The SVM classifier originally designed for problems of binary classification problems is usually based on a one-versus-the-rest or a oneversus-one strategy for multiclass classification. However, bias will exist in the training data set when using a one-versusthe-rest strategy, since there are many times more negative samples than negative samples. There may be the same number of votes for multiple classes in the decision-making stage of one-versus-one strategies, so the testing images may belong to more than one category, affecting the classification accuracy.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Therefore, as we can see from the above statements, these two strategies cannot offer the optimum result under some conditions. In the case of scenes similar to each other, onestep methods may deliver a low classification accuracy because of the similar feature representations. In order to better reach the optimum, a refined fuzzy classification strategy (which can also be called “two-step” classification) is proposed for candidate labels in the first step of classification. Sophisticated categorization is then used to obtain the final predicted label from the several candidate labels. This strategy is effective in classifying scenes that are similar to each other. Last but not least, the previous CNN architectures are usually aimed at classifying segmented images, and they do not consider how to segment large HRIs [27]–[29]. However, large HRIs are usually more complex and detailed with different land-cover objects and spatial distributions. Inappropriate segmentation of an HRI will have a negative impact on the final classification accuracy. When faced with scenes with a mixture of objects from different categories, the scenes are usually assigned to the labels covering the largest area in the image. When the areas of different categories are close, it is hard to judge which category the scene belongs to by visual interpretation. Therefore, it is to some degree meaningless to distinguish between these mixed scenes. As can be seen in Fig. 2, the images are scenes segmented from the large HRI. The images [Fig. 2(a)–(c)] mainly consist of only one kind of object, such as forest, freeway, and residential. These scenes are easier to interpret. However, Fig. 2(d) is a scene with a mixture of forest, freeway, and residential, and which category the scene belongs to cannot be ascertained by visual interpretation. Therefore, a method to automatically capture scenes such as Fig. 2(a)–(c) rather than Fig. 2(d) needs to be performed to decrease the error in interpreting the scenes with a mixture of objects from different categories. Previous methods often split the HRI into a set of small overlapping images [29]–[31], which may generate scenes such as Fig. 2(d). Some researchers have manually selected scenes [33] to capture scenes where objects from one category cover a dominant area, but this takes a lot of time and cannot be applied to large HRIs. Therefore, a scene capture method is proposed to automatically capture segments where objects from one category occupy the most area, to decrease the error in interpreting the images. The major contributions of this paper are as follows. 1) The most representative visual vocabularies are selected from each category using MI and IVB with the scale information, decreasing the error resulting from the redundancy of the fused visual codebook, to improve the discriminative ability of the codebook. 2) A refined fuzzy classification strategy is proposed using several candidate labels in the first-step classification as the constraint of the second-step classification. This method can avoid misclassification in similar scenes, to some extent, which is a problem when classifying with the previous one-step classifiers. 3) A scene capture method is proposed to automatically capture segments without a mixture of different categories. This method can capture segments where one

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

3

Fig. 2. Scenes that are easier and harder to classify. (a) Scene mainly dominated by forest. (b) Scene mainly dominated by freeways. (c) Scene mainly dominated by residential. (d) Scene with a mixture of forest, freeway, and residential.

category occupies a dominant area, to decrease the error resulting from uncertainty in the interpretation of mixed scenes. The rest of this paper is organized as follows. In Section II, the proposed method based on scene capture and a refined fuzzy classification strategy with a selected fused codebook is introduced. The details of our experiments and the results are presented in Section III. A sensitivity analysis is given in Section IV. Finally, Section V concludes the paper with a discussion of the experimental results and our ideas for future work. II. S CENE C LASSIFICATION BASED ON S CENE C APTURE AND R EFINED F UZZY C LASSIFICATION W ITH A S ELECTED F USED C ODEBOOK Problems exist in the BOW-based methods of classification of HRIs, such as scale variance in the images [34], classification of similar categories [35], inappropriate segmentation of the HRI [36], and so on. Therefore, scene capture and a refined fuzzy classification strategy with a selected fused codebook are proposed for HRI scene classification. The proposed method consists of three main steps: 1) scene capture for the large HRI to capture images where one category occupies a dominant area; 2) multiscale patch sampling and selection of the representative visual vocabularies in each category, and fusion of the codebooks from the different categories; and 3) a refined fuzzy classification strategy via a probabilistic SVM classifier. The overall flowchart is shown in Fig. 3, where the text in red reflects the section that the displayed part corresponds to. A. Scene Capture Method for HRIs As can be seen in Fig. 2, how to segment scenes from the HRI will have an impact on the final classification accuracy.

Fig. 3.

Overall architecture of the proposed method.

The CNNs or previous BOW-based frameworks are not aimed at segmenting appropriate scenes from the HRI, which may lead to an error in semantically explaining scenes with a mixture of different categories if the HRI is segmented improperly. As a result, how to automatically capture the segments where one class occupies the most area in the HRI is still a challenging task. To handle this problem, an automatic scene capture method using adjacent segmented images is proposed.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Fig. 4.

Sketch of the proposed scene capture method.

Before illustrating the method, an assumption has to be made. Image segments such as Fig. 2(a)–(c) are more likely to be correctly classified than those that demonstrate a mixture of different categories, such as Fig. 2(d). When probabilistic outputs from the SVM classifier are used, the above-mentioned assumptions can be regarded as the dominant maximum probability in one category in one image. As shown in [37], the probabilistic output for each label in the SVM classifier can be calculated as shown in (1)–(3) 1 (1) 1 + exp(A f + B)  where f (x) = h(x) + b and h(x) = i yi αi k(x i , x). The best parameter setting (A, B) can be determined by solving the following regularized maximum likelihood problem:

chosen as the segmented scene for this image. Otherwise, no image is selected as the scene for this image. 4) This method captures the optimal segments where one category covers a dominant area through a moving window adjacent to the original segmented patches to decease the error resulting from inappropriate interpretation of the segmented patches. These optimal segments can better represent the category they belong to than the original patches.

PA,B ( f ) ≡

min F(z) = −

( A,B)

l 

(ti log( pi ) + (1 − ti ) log(1 − pi )) (2)

i=1

where pi = PA,B ( f i ) and ⎧N +1 + ⎪ if yi = +1 ⎨ N + +2 ti = 1 ⎪ ⎩ if yi = −1, N− + 2

i = 1, . . . , l

(3)

with N+ of the yi s positive and N− negative. As can be seen in Fig. 4, the criteria used for capturing the optimum scene can be illustrated as follows. 1) First of all, the HRI can be split into a set of overlapping images. For each split image, the probabilistic outputs in each label can be calculated for images pi in eight directions, along with the original image. One maximum probability from the probabilistic outputs in each image can then be calculated and the maximum probabilities in all the adjacent images can be obtained. 2) The highest maximum probabilities in all the adjacent images are then calculated, and the image corresponding to the highest maximum probability is chosen as the candidate scene. 3) Finally, if the maximum probability of the candidate scene is larger than a defined threshold, this image is

B. Selection of Visual Vocabularies in the Fused Codebook The only spectral features are inadequate for scene classification since scenes from different categories may demonstrate similar spectral characteristics and those from the same category may demonstrate different spectral features. Dense SIFT features [18] contain not only information of spectral values of key points and their neighborhood pixels but also local spectral contrast between key points and their neighborhood pixels. Therefore, dense SIFT features are clustered by k-means to generate discriminative visual vocabularies in each category for scene classification. However, visual vocabularies from a single scale or a randomly selected scale with a limited range, as used in previous approaches, may fail to describe the image regions at other scales [38]. Therefore, the visual words in a multiscale visual codebook are proposed. The multiscale visual codebook combining the global and local features into a uniform framework can give us a richer representation of the scene image. We assume that the image is I ∈ R m×n , and this image can be represented by a visual codebook made up of multiscale vocabularies V = {Vs , s = 1, 2, . . . , S}, where Vs = {Vi(s) , i = 1, 2, .., n s } denotes a set of visual words at scale s. In order to generate the multiscale visual vocabularies, the training images are divided into overlapping square patches in different scales, as shown in Fig. 5. For scale s, the width and height of the patches are (W/2s−1 ) and (H /2s−1), respectively. Fig. 5 displays the multiscale patches at scales 1–3. The number of square patches for scale s is (2s −1)2 and the weight for scale s is (1/2 S−s+1).

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

Fig. 5.

Multiscale overlapped patches.

In the previous visual codebook, features of the whole training data set are clustered using k-means or fuzzy c-means, which may lose its discriminability when classifying similar categories. The fused codebook is clustered from features in each category. However, the vocabularies in different categories may be the same, thus leading to redundancy in the fused codebook. Therefore, the dimension of the visual vocabularies needs reducing [39]. First of all, the initial visual vocabularies are derived from clustering all dense SIFT features in each scale of one specific category. Then classical MI and proposed IVB measuring the distinctiveness and generalization of vocabularies are used to select the most discriminative visual vocabularies from vocabularies in all scales in each category to improve the discriminative ability of the final fused codebook. The selected vocabularies will be multiplies by their weights corresponding to their scales and the vocabularies in all categories form the final fused codebook. The procedure of the selection procedure is shown in Fig. 6 and details will be illustrated as follows. As can be seen in Fig. 6, dense SIFT features are extracted from each image in each scale. Then for one specific category, dense SIFT features in each scale are clustered by k-means to obtain visual codebook in each scale. However, there may exist redundancy in visual vocabularies from all scales. So MI and IVB are used to select the most representative visual vocabularies in each category. Assuming that V = {Vis , s = 1, 2, . . . , S, i = 1, 2, . . . C} is the visual vocabularies in the scene category i with the j scale s, then the MI between the j th vocabulary of Vis and j category i , namely, MI(Vis , i ), can be calculated as shown in the following equation:  j u  j   j P Vis , k MI Vis , i = P Vis , k log  j (4) P Vis P(k) k=1 j

where u is the number of categories, P(Vis , k) reflects the j joint probability of visual vocabulary Vis and images which j belong to category i ,P(Vis ) means the probability of the visual j vocabulary Vis , and P(k) means the probability of the images which belong to category k. A visual vocabulary with a higher MI value for one specific category means that there is a strong correlation between this visual vocabulary and the category. A desired property of a discriminative visual vocabulary is that it should have high values j j of MI(Vis , i ) while having low values of MI(Vis , j ) j = i . It is not enough to measure the discrimination of one vocabulary only with MI. A visual word can be made up of

5

features from different objects in the same image, with many of them probably belonging to the same category. Even when different, objects from the same class should share several visual vocabularies. Taking this into consideration, a vocabulary best describes a category made up of the features from the objects similar to that class. Therefore, IVB is proposed to evaluate how much a given class deviates from its ideal value. A visual vocabulary with a higher IVB value for one specific category means that this visual vocabulary demonstrates a strong generalization over the category. A desired property of a discriminative vocabulary is that it should have high values j j of IVB(Vis , i ) while having low values of IVB(Vis , j ) j = i . j With the visual vocabulary Vis for a given object category i , j the IVB(Vis , i ) can be calculated as shown in the following equation:

 Oi O m,Visj ,i  j Oi 1 IVB Vis , i = 1 − f j − O (5) 2(Oi − 1) i V m=1 is,i where Oi is the number of images of category i in the training set, Om,V j ,i reflects the number of features extracted from the is

j

mth image in category i belonging to visual vocabulary Vis , and f V j ,i is the number of features of category i that belong is

j

to visual vocabulary Vis . MI and IVB produce two different evaluations of the discrimination. In order to find a consensus, the Borda count algorithm [40] is used to combine these two evaluations. Assuming that the number of visual vocabularies in each category is |K |, then a visual vocabulary receives |K | votes if it ranks first in terms of MI or IVB, |K |−1 votes for a second preference, |K|—2 for a third, and so on for each ranking. The individual votes for each vocabulary are then added and the final ranking of the vocabularies is obtained for each category. If the final rankings are below a threshold, they are removed from the codebook in this category. The same threshold is applied to all the categories. The eliminated vocabularies may exist in different categories or they cannot reflect the representative information in one specific category. The vocabularies are then multiplied by their weights corresponding to the scale and fused by concatenation. The proposed selection method selects discriminative visual vocabularies for each category. It selects the vocabularies from all the scales demonstrating high representativeness and generalization over the category, and fuses all the selected vocabularies in each category with their corresponding weights to form the final codebook. This method can remove some of the vocabularies existing in different categories to improve the distinctiveness of the codebook. C. Refined Fuzzy Classification Strategy Some images similar in feature representation may output close probabilities in several labels when performing multilabel probabilistic classification. These labels may include both true labels and incorrect labels. The previous methods distinguish the true labels from all the other labels, but labels that are very different from the true labels in feature representation are unlikely to be the true labels. Therefore, a refined

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

Fig. 6.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Procedure of selection of the most representative vocabularies in the fused visual codebook.

classifier is generated for every two labels, forming three twoclass probabilistic SVM classifiers to determine which label is the most similar to the true label. Three probabilistic results are obtained from the three two-class probabilistic SVM classifiers. The label with a higher probability is considered as a more competitive label in this two-class probabilistic classifier, and the label that is most competitive in the two probabilistic classifiers is regarded as the predicted label. However, in the case of all the three labels being more competitive in only one classifier, the label with the highest sum of probabilities is considered as the predicted label. The refined fuzzy classification method removes the influence of labels unlikely to be the true labels and selects the most competitive label by comparing every two labels in the three candidate labels. This strategy can improve the classification performance when categories whose probabilities in the classifiers are close to each other exist. III. E XPERIMENTAL S ETUP AND R ESULTS

Fig. 7.

Illustration of the refined fuzzy classification strategy.

fuzzy classification strategy is proposed, as shown in Fig. 7. This strategy uses the results in the first-step classification to generate labels more likely to be true labels, and regards these labels as candidate labels in the second-step classification. The details are as follows. The three labels with the highest probabilistic output in the SVM classifier are selected as candidate labels to avoid misclassification with similar categories since the true label may be very similar to one or two other labels. Assuming that these three labels are i, j, k, then a two-class probabilistic SVM

In this section, the data sets used for the experiments and the parameter settings of the proposed method are described. The results obtained for the scene classification of a benchmark high-resolution aerial orthoimagery data set and two large high-resolution aerial orthoimageries are then annotated and analyzed. These three data sets are all acquired from ordinary digital camera with only three channels red, green, and blue for three data sets. A. Description of the Data Sets Three different types of data sets were used in the experiments. The first image data set was the UC Merced data set [41]. This data set was manually extracted from aerial orthoimagery and downloaded from the United States Geological Survey (USGS) National Map program. This data set consists of 21 challenging categories, with 100 images per class. Sample images in each category of this data set are shown in Fig. 8. The images were rotated by 90°, 180°, and 270°, respectively, with each class finally containing

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

7

TABLE I O PTIMAL PARAMETER S ETTINGS FOR THE T HREE D ATA S ETS

The third image data set was acquired from aerial images, covering Jinmen, Hubei province, China. The spatial resolution of this image is 0.14 m with only three channels RGB. The large image to be annotated was of 11 791 × 12 119 pixels, as shown in Fig. 11(a). There were seven main categories of training images: airport, industry, residential, bare land, vegetation, freeway, and river. The original image was converted into 400 × 400 pixel subimages. B. Experimental Setup for HRI Scene Classification Fig. 8. Example images from the 21 land-use categories in the UC Merced data set.

Fig. 9. Original images and rotated images. (a) Original image. (b) Images rotated by 90°. (c) Images rotated by 180°. (d) Images rotated by 270°.

400 images. Samples of the rotated images are displayed in Fig. 9. Experiments using both the original and rotated data sets were undertaken. The images have a spatial resolution of 30 cm in the RGB color space, with a size of 256 × 256. The data set represents highly overlapping classes, such as dense residential, medium residential, and sparse residential, which mainly differ in the density of buildings. The second image data set was again acquired from the USGS, covering Montgomery County, OH, USA [29]. The spatial resolution of this image is 0.6 m with the only channels of RGB. The large image to be annotated was of 10 000 × 9000 pixels, as shown in Fig. 10(a). There were five main classes of training images: residential, farm, forest, freeway, and parking lot. The original image was converted into 150 × 150 pixel subimages.

In the experiments, all the images were uniformly sampled with a patch size and spacing of eight and four pixels, respectively, to extract the dense SIFT features. To test the stability of the proposed scene classification method, the different methods were executed five times by a random selection of 50% of all the segmented images as the training data set, to obtain convincing results for three data sets, and the other 50% of the images were used as the test images. The training data set was divided into two parts: 50% for training and 50% for finding the optimal parameter settings. The calculated optimal parameter settings were then used for classifying the test images with the liblinear implementation [42]. The parameter settings included the scale s, the percentage of visual words p, the codebook size k, the threshold of the scene capture strategy t, and the number of training samples No. The parameter settings of SVM, c and g, were calculated by crossvalidation. The ranges of these parameters were as shown in the following equation: p = {20%, 35%, 50%, 65%, 80%} k = {20 ∗ i, 35 ∗ i, 50 ∗ i, 65 ∗ i, 80 ∗ i} t = {0.2, 0.3, 0.4, 0.5, 0.6} s = {1, 2, 3, 4} No = {20%, 35%, 50%, 65%, 80%} c = {2−2 , 2−1 , . . . , 21 , 22 } g = {10−2 , 10−1 , . . . , 101 , 102 }

(6)

The optimal parameter settings for the three data sets are displayed in Table I, where i is the number of categories in the data. The sensitivity analysis for the data sets was performed by fixing the other six optimal parameters and changing only one parameter, as shown in Section IV. In order to further evaluate the performance of the proposed method, the obtained experimental results were compared with the three mid-level methods of BOW, PLSA, and LDA and three well-known CNN architectures: OverFeat, CaffeNet, and GoogLeNet. Both results with and without the refined fuzzy

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

Fig. 10.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

(a) Whole image for scene classification. (b) Example images associated with the five land-use categories from the image.

Fig. 11. (a) Whole image for scene classification. (b) Example images with the seven land-use categories from the image (1) airport, (2) bare land (3) freeway, (4) industry, (5) residential, (6) river, and (7) vegetation.

classification strategy and with and without the selected fused vocabularies were obtained.

TABLE II C OMPARISON W ITH S TATE - OF -A RT M ETHODS IN T ERMS OF C LASSIFICATION A CCURACY W ITH THE O RIGINAL AND ROTATED UC M ERCED D ATA S ETS

C. Experiment 1: Classification of the UC Merced Data Set First of all, in order to prove the superiority of the proposed refined fuzzy classification strategy and the selection of the fused codebook, the classification performance of the proposed approach was compared with several of the state-of-art methods mentioned in Section III-B. The results obtained with both the original data set and the rotated data set are displayed in Table II. As can be seen in Table II, the proposed method delivers the third-best classification accuracy in the original and rotated UC Merced data sets, outperforming all the other approaches, except for CaffeNet and OverFeat. This is because the UC Merced data set consists of segmented images, and the scene capture strategy cannot be used for more appropriate segmentation. The classification accuracies obtained with the rotated data set are lower than those in the original data set, but the decrease in accuracies of the proposed method and the CNN architectures is lower than that of the other methods. This is due to the fact that more rotation variance exists in the rotated data set than the original data set. CNNs extract highlevel features in the images, decreasing the effect of rotation

transform. The proposed method finds the most representative visual vocabularies from each scale in each category, and the patches in each scale are different from those in SPM, increasing the possibility of features belonging to one specific category. Therefore, the proposed method is more robust to the rotation transform than the other mid-level representations without a multiscale selected fused codebook.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

9

Fig. 12. Producer’s accuracies with the UC Merced data set for the proposed method with the rotated data set. The class labels are assigned as follows: 1 = agricultural, 2 = airplane, 3 = baseball diamond, 4 = beach, 5 = buildings, 6 = chaparral, 7 = dense residential, 8 = forest, 9 = freeway, 10 = golf course, 11 = harbor, 12 = intersection, 13 = medium residential, 14 = mobile home park, 15 = overpass, 16 = parking lot, 17 = river, 18 = runway, 19 = sparse residential, 20 = storage tanks, and 21 = tennis court.

The classification accuracy was also compared with methods without fused vocabularies and without selected fused vocabularies. The experimental results demonstrate that both the refined fuzzy classification strategy and the selected fused vocabularies improve the classification accuracy, but the refined fuzzy classification strategy plays a more important role in the classification since it removes the impact of some irrelevant labels on the classification. An overview of the performance of the proposed method and methods without the two strategies is shown in Figs. 12 and 13, respectively. As can be seen in the confusion matrix, most of the scene classes achieve excellent classification performances by the use of the proposed method, and the airplane, chaparral, forest, golf course, and sparse residential scenes can be almost fully recognized by the proposed method, with an accuracy close to one. However, some scenes deliver relatively poor performances, including dense residential, freeway, intersection, and rivers, since these scenes may be confused with more than two categories. As can be seen in Fig. 13(a)–(c), the labels may be confused with more than two categories, but the number of categories that the labels in Fig. 13(c) are confused with less than that in Fig. 13(a) and (b). This is because the refined classification strategy removes some labels unlikely to be true labels, and thus these removed labels have no effect on the subsequent classification.

Fig. 13. Confusion matrices showing the classification performance with the UC Merced data set (a) without using the refined fuzzy classification strategy, (b) without the selected fused vocabularies, and (c) proposed method.

TABLE III C OMPARISON W ITH S TATE - OF - THE -A RT M ETHODS IN A CCURACY IN SIRI-WHU AND J INMEN D ATA S ETS

D. Experiment 2: Semantic Annotation of the SIRI-WHU Data Set The overall classification accuracies were compared with the methods mentioned in Section III-B. Table III shows the average overall accuracies for all the methods. It can be seen that the proposed method yields the best accuracy and outperforms the three CNN architecture approaches, as a result of the scene capture strategy. Note that the CNN architectures and the proposed method all classify images segmented by a scene capture strategy. The scene capture strategy selects scenes where one category occupies a dominant area, which

can decrease the error resulting from classifying scenes with a mixture of objects from different categories. The proposed method performs better than the CNN architectures in terms of classification accuracy because the number of training samples is limited and the classified categories are dissimilar, which is

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

Fig. 14.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

Semantic annotation of the SIRI-WHU data set. (a) Original image. (b) Without scene capture strategy.(c) With scene capture strategy.

Fig. 15. Confusion matrices showing the classification performances with the SIRI-WHU data set for the three different methods. (a) With both strategies. (b) Without the scene capture strategy. (c) Without both strategies.

beneficial to the refined classification strategy. The results also confirm that both the refined fuzzy classification strategy and the scene capture strategy are efficient ways to increase the classification accuracy. The annotated map for the SIRI-WHU image is shown in Fig. 14. The confusion matrices in each category are reported in Fig. 15. As can be seen in Fig. 15, it demonstrates a higher classification accuracy in Fig. 14(c) than Fig. 14(b) since these scenes are mainly dominated by objects from one category. In Fig. 14(b), the major confusion occurs on the borders of different categories since these images may cover different categories. It can also be seen that a lot of scenes of the farmland and freeway categories are misclassified into forest, because the representative objects of the true label may be covered by trees. Compared with those in Fig. 14(b), the borders of the different categories in Fig. 14(c) are more accurate since the location of the segmented images is more flexible than the existing overlapping patches. As can be seen in Fig. 15(a), the majority of the confusion in the proposed approach occurs between farmland and forest and parking lot and residential, as these scenes are dominated by similar backgrounds. There is also confusion between freeway and forest since the freeway scenes may contain many trees. As can be seen in Fig. 15(a) and (b), the scene capture strategy may reduce the effect of scenes with a mixture of different categories on classification. Method without two strategies

displayed in Fig.15(c) demonstrates the lowest classification accuracy in all categories because of poor performance in classifying similar categories or scenes with a mixture of categories. The scenes shown in Fig. 16 all consist of objects from farmland and forest, but the farmland and forest classes occupy different proportions in the scenes, leading to misclassification in these two categories. The images in Fig. 16(b) and (e) are classified correctly into their true labels, but Fig. 16(c) and (d) may be misclassified since forest and farmland occupy close areas, by visual interpretation. This is mainly because of moving the scene by 50 pixels is not enough to avoid the scenes with a mixture of different objects. Similar situations may exist in the parking lot and residential classes or the forest and residential classes. This point is reflected in the accuracy results in Fig. 15. E. Experiment 3: Semantic Annotation of the Jinmen Aerial Data Set The experimental results obtained with BOW, PLSA, LDA, and the three CNN architectures in the Jinmen data set are shown for comparison in Table III. Similar conclusions to Section III-D can be drawn, in that the proposed method outperforms the other methods in terms of classification

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

11

Fig. 16. Examples of scenes correctly and incorrectly classified. The red reflects examples incorrectly classified. (a) Farmland scene. (b) Scene with 1/8 forest and 7/8 farmland. (c) Scene with 1/4 forest and 3/4 farmland. (d) Scene with 1/2 forest and 1/2 farmland. (e) Scene with 3/4 forest and 1/4 farmland. (f) Forest scene. Scenes (c) and (d) are difficult to classify by visual interpretation

Fig. 17.

Annotated image obtained by the proposed method. (a) Image to be annotated. (b) Annotated image.

accuracy. These methods all classify images segmented by a scene capture strategy. The effectiveness of the scene capture strategy and the refined classification strategy can be seen in the improved classification accuracy. The annotation results obtained with the proposed method for the large Jinmen aerial image are shown in Fig. 17(b). As can be seen from Fig. 17(b), the major confusion again occurs on the borders. Some confusion exists between the airport and freeway as the parking apron in the airport is similar to the freeway. Due to the fact that the industry and residential scenes are similar in texture, the two scene classes also show confusion. However, the overall annotation performance is still satisfactory. One confusion matrix for the Jinmen aerial data set was selected from the results obtained by the proposed method, and is shown in Fig. 18. From Fig. 18, it can be seen that all of the scenes can be recognized well by the proposed method, with an accuracy close to one, except for the industry and residential scenes. This is, however, reasonable as the industry and residential scenes are all composed of buildings. There is also some confusion between bare land and vegetation since their backgrounds are similar. IV. S ENSITIVITY A NALYSIS

Fig. 18. data set.

Confusion matrix of the proposed method with the Jinmen aerial

for both the UC Merced and SIRI-WHU data sets. As shown in Table IV, with the increase in the scale s, the overall accuracies of the two methods both become higher and higher. Therefore, the scale of 4 is chosen as the optimal scale for generating the codebook. It is notable that the selected fused codebook is superior to the unselected fused codebook over the entire range, for both data sets, by an average of over 4% in classification accuracy, demonstrating the effectiveness of the selection of representative vocabularies.

A. Sensitivity Analysis in Relation to the Scale s To investigate the sensitivity of the fused vocabularies with or without selecting vocabularies in relation to the scale s, the other six parameters were kept at the optimal parameter settings. The scale s was then varied over the range of [1–4]

B. Sensitivity Analysis in Relation to Percentage of Selected Visual Words p To investigate the sensitivity of the selected fused codebook in relation to the percentage of remaining vocabularies p,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

TABLE IV C LASSIFICATION A CCURACY (%) FOR V ISUAL V OCABULARIES C OMBINING S CALES 1–4, R ESPECTIVELY, U SING U NSELECTED AND S ELECTED C ODEBOOKS

Fig. 19.

Effect of the codebook size on classification accuracy in both data sets. (a) SIRI-WHU. (b) UC Merced. TABLE V C LASSIFICATION A CCURACY (%) FOR D IFFERENT P ERCENTAGES OF V ISUAL V OCABULARIES

the other six parameters were kept at the optimal parameter settings. The percentage was then varied over the range of [20%, 35%, 50%, 65%, 80%, and 95%], for both data sets. As can be seen in Table V, when the percentage is below 65%, the classification accuracy improves with the increase in the percentage of selected visual words p in the UC Merced data set. However, the classification accuracy decreases when the percentage is above 65%. A similar trend can be seen in the SIRI-WHU data set, where the classification accuracy reaches a peak at 80%. C. Sensitivity Analysis in Relation to the Codebook Size k Since the codebook size k has an impact on the final classification accuracy, the codebook size was varied over the range of [420, 735, 1050, . . . , 1995] for the UC Merced data set and [100, 175, 250, . . . , 475] for the SIRI-WHU data set. As can be seen from Fig. 19(a), when the codebook size is below 400, the classification accuracy improves gradually with the increase of codebook size. When the codebook size is above 400, the classification accuracy improves only slightly.

The variation trend in the UC Merced data set is similar to that in the SIRI-WHU data set. The blue circle represents the optimal parameter setting corresponding to the data set in Section III-B. D. Sensitivity Analysis in Relation to the Threshold of the Scene Capture Strategy The threshold of the maximum probability used in the scene capture strategy was varied over the range of [0.2, 0.3, . . . , 0.7] for the SIRI-WHU data set. A balance between the classification accuracy and the number of samples needs to be kept to retain enough samples for training. As a result, experiments on the effect of the number of samples were performed. As can be seen in Fig. 20(a), the classification accuracy improves with the increase in the threshold. When the threshold changes from 0.3 to 0.4, a sharp increase occurs since a large number of scenes with a mixture of different objects have been removed. When the threshold is above 0.6, the classification accuracy decreases slightly. The trend in Fig. 20(b) corresponds to that in Fig. 20(a), to some degree.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

13

Fig. 20. Effect of the maximum probability threshold on classification accuracy and the number of samples. (a) Effect of the threshold on classification accuracy. (b) Effect of the threshold on the number of remaining image.

improves slightly in classification accuracy. The SIRI-WHU data set displays a similar variation trend to the UC Merced data set. The blue circle represents the optimal parameter setting corresponding to the data set in Section III-B. V. C ONCLUSION

Fig. 21. Effect of the number of training samples on classification accuracy with the SIRI-WHU and UC Merced data sets.

When the threshold changes from 0.3 to 0.5, a sharp decrease occurs. When the threshold is 0.6, about 400 samples still remain, which is enough for training and testing. The blue circle represents the optimal parameter setting in Section III-B. E. Sensitivity Analysis in Relation to the Number of Training Samples Per Class The classification accuracy is limited by the number of training samples per class. The percentage of samples in the training data set was varied over the range of [0.2, 0.35, . . . , and 0.95], for both data sets. The overall accuracies are reported in Fig. 21. As can be seen from Fig. 21, when the percentage of samples is under 0.8, the classification accuracy decreases sharply with the decrease in training samples. However, when the percentage of samples is above 0.8, the proposed method

In this paper, a scene capture strategy has been proposed in order to automatically capture the scenes dominated by one category in the HRIs. This strategy avoids scenes with a mixture of different categories and improves the classification accuracy. A refined fuzzy classification strategy for classifying similar categories is also presented, and the most representative fused visual words are selected by MI and IVB so as to reduce the redundancy in the fused visual vocabularies. The experiments allowed the following conclusions to be made. 1) By using the scene capture strategy, scenes where one category covers a dominant area can be captured automatically, and methods with this strategy outperform those without the strategy, and even some CNN architectures, in terms of the classification accuracy. 2) The selected fused codebook demonstrates a higher discriminative ability and classification accuracy than that without selection, showing the effectiveness of MI and IVB in vocabulary selection. 3) The refined fuzzy classification strategy solves the problem of classifying similar categories, to some degree, and the overall classification accuracy is higher than most of the one-step classification methods. In our future work, other features more appropriate for HRIs, such as texture, shape, or structural features, will be explored for scene classification. Modeling the high-level spatial information between visual words will be considered to solve the problem of classifying scenes with the same composition but different arrangements, and semantic segmentation of scenes will be explored in order to solve the problem of different sizes of objects in the scene.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 14

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

R EFERENCES [1] K. Qi, H. Wu, C. Shen, and J. Gong, “Land-use scene classification in high-resolution remote sensing images using improved correlatons,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 12, pp. 2403–2407, Dec. 2015. [2] G.-S. Xia et al., “AID: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 3965–3981, Jul. 2017. [3] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. [4] T. Ojala, M. Pietikäinen, and T. Mäenpää, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, Jul. 2002. [5] M. J. Swain and D. H. Ballard, “Color indexing,” Int. J. Comput. Vis., vol. 7, no. 1, pp. 11–32, 1991. [6] A. Oliva and A. Torralba, “Building the gist of a scene: The role of global image features in recognition,” Progr. Brain Res., vol. 155, pp. 23–63, Jan. 2006. [7] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Proc. Workshop Statist. Learn. Comput. Vis. (ECCV), vol. 44. 2004, pp. 1–22. [8] A. Bosch, A. Zisserman, and X. Muñoz, “Scene classification via pLSA,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 517–530. [9] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [10] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Lecun. (2013). “OverFeat: Integrated recognition, localization and detection using convolutional networks.” [Online]. Available: https:// arxiv.org/abs/1312.6229 [11] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embedding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678. [12] C. Szegedy et al., “Going deeper with convolutions,” in Proc. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1–9. [13] Q. Zhu, Y. Zhong, L. Zhang, and D. Li, “Scene classification based on the fully sparse semantic topic model,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 10, pp. 5525–5538, Oct. 2017. [14] O. A. B. Penatti, K. Nogueira, and J. A. D. Santos, “Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?” in Proc. Comput. Vis. Pattern Recognit. Workshops, Jun. 2015, pp. 44–51. [15] F. Zhang, B. Du, and L. Zhang, “Saliency-guided unsupervised feature learning for scene classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 4, pp. 2175–2184, Apr. 2015. [16] J. Yang, Y. G. Jiang, A. G. Hauptmann, and C. W. Ngo, “Evaluating bag-of-visual-words representations in scene classification,” in Proc. ACM SIGMM Int. Workshop Multimedia Inf. Retr., Augsburg, Germany, Sep. 2007, pp. 197–206. [17] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in Proc. 10th IEEE Int. Conf. Comput. Vis., vol. 2. Oct. 2005, pp. 1800–1807. [18] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jun. 2006, pp. 2169–2178. [19] L.-J. Zhao, P. Tang, and L.-Z. Huo, “Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 12, pp. 4620–4631, Dec. 2014. [20] J. Qin, F. Deng, and N. H. C. Yung, “Scene categorization based on local–global feature fusion and multi-scale multi-spatial resolution encoding,” Signal, Image Video Process., vol. 8, no. 1, pp. 145–154, 2014. [21] J. Qin and N. H. C. Yung, “Feature fusion within local region using localized maximum-margin learning for scene categorization,” Pattern Recognit., vol. 45, pp. 1671–1683, Apr. 2012. [22] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Trans. Neural Netw., vol. 5, no. 4, pp. 537–550, Jul. 1994. [23] T. Cover, “Estimation by the nearest neighbor rule,” IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 50–55, Jan. 1968. [24] M. M. Adankon and M. Cheriet, “Support vector machine,” in Proc. Int. Conf. Intell. Netw. Intell. Syst., 2009, pp. 418–421. [25] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.

[26] P. O. Gislason, J. A. Benediktsson, and J. R. Sveinsson, “Random forests for land cover classification,” Pattern Recognit. Lett., vol. 27, no. 4, pp. 294–300, 2006. [27] G. Cheng, C. Ma, P. Zhou, X. Yao, and J. Han, “Scene classification of high resolution remote sensing images using convolutional neural networks,” in Proc. Geosci. Remote Sens. Symp., Jul. 2016, pp. 767–770. [28] G. J. Scott, R. A. Marcum, C. H. Davis, and T. W. Nivin, “Fusion of deep convolutional neural networks for land cover classification of highresolution imagery,” IEEE Geosci. Remote Sens. Lett., vol. 14, no. 19, pp. 1638–1642, Sep. 2017. [29] X. Lu, X. Zheng, and Y. Yuan, “Remote sensing scene classification by unsupervised representation learning,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. 5148–5157, Sep. 2017. [30] Y. Zhong, Q. Zhu, and L. Zhang, “Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 11, pp. 6207–6222, Nov. 2015. [31] B. Zhao, Y. Zhong, and L. Zhang, “Scene classification via latent Dirichlet allocation using a hybrid generative/discriminative strategy for high spatial resolution remote sensing imagery,” Remote Sens. Lett., vol. 4, no. 12, pp. 1204–1213, Dec. 2013. [32] B. Zhao, Y. Zhong, G.-S. Xia, and L. Zhang, “Dirichlet-derived multiple topic scene classification model for high spatial resolution remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 4, pp. 2108–2123, Apr. 2016. [33] F. Zhang, B. Du, and L. Zhang, “Scene classification via a gradient boosting random convolutional network framework,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 3, pp. 1793–1802, Mar. 2016. [34] J. Zou, W. Li, C. Chen, and Q. Du, “Scene classification using local and global features with collaborative representation fusion,” Inf. Sci., vol. 348, pp. 209–226, Jun. 2016. [35] L. Yan, R. Zhu, N. Mo, and Y. Liu, “Improved class-specific codebook with two-step classification for scene-level classification of high resolution remote sensing images,” Remote Sens., vol. 9, no. 3, p. 223, 2017. [36] P. P. Singh and R. D. Garg, “Classification of high resolution satellite images using spatial constraints-based fuzzy clustering,” J. Appl. Remote Sens., vol. 8, no. 1, p. 083526, 2014. [37] J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Adv. Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999. [38] J. Qin and N. H. C. Yung, “Scene categorization with multiscale category-specific visual words,” Opt. Eng., vol. 48, no. 4, p. 047203, 2009. [39] X. Zheng, Y. Yuan, and X. Lu, “Dimensionality reduction by spatial– spectral preservation in selected bands,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 9, pp. 5185–5197, Sep. 2017. [40] P. Emerson, “The original Borda count and partial voting,” Social Choice Welfare, vol. 40, no. 2, pp. 353–358, 2013. [41] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proc. SIGSPATIAL Int. Conf. Adv. Geographic Inf. Syst., 2010, pp. 270–279. [42] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, 2011.

Li Yan (M’17) received the B.S., M.S., and Ph.D. degrees in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 1989, 1992, and 1999, respectively. He is currently a Luojia Distinguished Professor with the School of Geodesy and Geomatics, Wuhan University. His research interests include photogrammetry, remote sensing, and precise image measurement.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YAN et al.: SCENE CAPTURE AND SELECTED CODEBOOK-BASED REFINED FUZZY CLASSIFICATION

Ruixi Zhu received the B.S. degree in surveying and mapping engineering and the M.S. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2014 and 2017, respectively, where he is currently pursuing the Ph.D. degree with the School of Geodesy and Geomatics. His research interests include high-resolution image classification and domain adaptation in remote sensing applications.

Yi Liu received the B.S. degree in computer science and technology from China University of Geosciences, Wuhan, China, and the Ph.D. degrees in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2003 and 2009, respectively. She is currently an Associate Professor with the School of Geodesy and Geomatics, Wuhan University. Her research interests include remote sensing image processing and deep learning.

15

Nan Mo received the B.S. degree in surveying and mapping engineering from the China University of Mining and Technology, Xuzhou, China, in 2014. She received the M.S. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2017, where she is currently pursuing the Ph.D. degree with the School of Geodesy and Geomatics. Her research interests include high-resolution image classification and object recognition, image processing, and deshadowing of remote sensing images.

Suggest Documents