Feature Selection via Cramer's V-Test Discretization for ... - IEEE Xplore

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 52, NO. 5, MAY 2014

2593

Feature Selection via Cramer’s V-Test Discretization for Remote-Sensing Image Classification Bo Wu, Liangpei Zhang, Senior Member, IEEE, and Yindi Zhao

Abstract—A feature selection method based on Cramer’s V-test (CV-test) discretization is presented to improve the classification accuracy of remotely sensed imagery. Three possible contributions are pursued in this paper. First of all, a Cramer’s V-based discretization (CVD) algorithm is proposed to optimally partition the continuous features into discrete ones. Two association-based feature selection indexes, the CVD-based association index (CVDAI) and the class-attribution interdependence maximization (CAIM)-based association index (CAIMAI), derived from the CV-test value, are then proposed to select the optimal feature subset. Finally, the benefit of using discretized features to improve the performance with the J48 decision tree (J48-DT) and naive Bayes (NB) classifiers is studied. To validate the proposed approaches, a high spatial resolution image and two hyperspectral data sets were used to evaluate the performances of CVD and the associated algorithms. The test performances of discretization using CVD and two other state-of -the-art methods, the CAIM and equal width, show that the CVD-based technique has the better ability to generate a good discretization scheme. Furthermore, the feature selection indexes, CVDAI and CAIMAI, perform better than the other used feature selection methods in terms of overall accuracies achieved by the J48-DT, NB, and support vector machine classifiers. Our tests also show that the use of discretized features benefits the J48-DT and NB classifiers. Index Terms—Association index, Cramer’s V-test (CV-test), feature discretization, feature selection, image classification.

I. I NTRODUCTION

C

URRENTLY, the most commonly available remote sensing (RS) data usually provide fine resolutions in both the spatial and spectral domains. Data acquired from these sensors provide detailed and spectrally continuous spatial information about the land surface, thus opening a wide range of opportunities and challenges for RS image analysis. To accurately extract information from such data, prior researchers

Manuscript received June 28, 2012; revised January 22, 2013 and April 14, 2013; accepted May 10, 2013. Date of publication June 26, 2013; date of current version February 28, 2014. This work was supported in part by the National Key Technology Research and Development Program of China under Grant 2013BAC08B01, by the National Basic Research Program of China (973 Program) under Grant 2011CB707105, and by the National Natural Science Foundation of China under Grants 41101336 and 41061130553. B. Wu is with the Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Fuzhou University, Fuzhou 350002, China (e-mail: [email protected]). L. Zhang is with the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China. Y. Zhao is with the School of Environment Science and Spatial Informatics, China University of Mining and Technology, Xuzhou 221008, China. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2013.2263510

have suggested to extract and use various potential features, such as the spectral, structural, textural, and shape features, to improve the image interpretation [1]–[4]. Moreover, it is possible to add complimentary data, such as zone planning and geographic positioning [5], in addition to spectral features and their derived spatial information, which can result in a mixed data mode (both continuous and discrete). In this sense, it is reasonable to assume that the RS features to be handled will normally be many and complex. In light of this abundant information, developing efficient data processing techniques is an urgent requirement. It is well known that the use of all possible features in a classification procedure results in information redundancy and can significantly reduce the overall accuracy [6], [7]. Employing feature selection techniques to obtain the most useful subset from the original set is therefore a critical step for successful image processing. As a consequence, dimensionality reduction techniques have long attracted attention from the RS community [7]–[12]. In general, two broad categories of dimensionality reduction techniques are encountered in RS applications: feature extraction methods and feature selection methods. The feature extraction methods provide a small set of new features which contain the vast majority of the original remotely sensed data set’s information, based on a transformation of the original data set. A vast number of feature extraction approaches have been proposed in the past decades. Some useful methods include projection pursuit [8], independent component analysis [9], decision boundary feature extraction, nonparametric weighted feature extraction [10], the tensor model [11], and so on. More popular, and of particular interest in this paper, are the feature selection methods. Feature selection is related to the selection of a subset of original features that captures the relevant properties of the entire data set. On the whole, feature selection algorithms fall into three categories: filter, wrapper, and embedded models. The filter-based techniques rely on the general characteristics of the data and evaluate features without involving any learning model. Some widely used algorithms, such as F-score [13], maximal relevance and minimal redundancy criterion (mRMR) [14], [15], relief feature selection (ReliefF) [16], cluster-based feature selection [17], and correlation-based feature selection [5], are filter-based approaches. Wrapperbased models use the model’s accuracy to select the subsets of features according to their predictive ability, implying a high computational cost. Various wrapper-based models, including sequential forward selection [18], genetic algorithms [19], and kernel methods [20], [21], have been found to be useful for RS classification. Differing from the wrapper-based methods, embedded methods allow interaction between the

0196-2892 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2594


feature selection procedure and the model. A typical embedded feature selector is recursive feature elimination, which uses the changes in the decision function of the support vector machine (SVM) as a criterion for the selection. Archibald and Fann [22] introduced an embedded feature selection algorithm designed to operate with SVM to simultaneously perform band selection and classification. In recent years, feature selection algorithms associated with SVM have been intensively studied. In [21] and [23], the extracted morphological profiles, together with the spectral features, were concatenated and then selected by SVM for image classification. Tuia et al. [20] proposed to learn the relevant features with SVM from high-dimensional data by the optimization of a linear combination of kernels dedicated to different meaningful sets of features. Waske et al. [24] used a multiple classifier system based on SVM and random feature selection in the classification of hyperspectral data. Although these methods have achieved varying degrees of success, as wrapper-based models, the SVM-based methods are the most time consuming and are only efficient with specified problems and classifiers. Another disadvantage of wrappers is that the ranked features do not provide a feature weighting since all the features are given the same weight. In contrast, filters can be easily scaled to very high-dimensional data sets, they are computationally more efficient, and, most of all, the selected features can be easily interpreted. Therefore, developing filterbased feature selection methods to address a mixed data mode, the optimal redundancy removal, and the exploitation of domain knowledge in the high-dimensional data is therefore of interest. We recently proposed a filter-based feature selection approach named the max–min-association index (MMAIQ) [25]. This method incrementally selects features simultaneously satisfying the criteria of a statistically maximal association between target labels and features and a minimal association between selected features with respect to the Cramer’s V-test (CV-test) value. MMAIQ has the ability to reduce the mutual information among features because it imposes a penalty on redundancy for the candidate features. Moreover, MMAIQ can address the complex features since the CV-test measures the variable association of mixed data modes (e.g., continuous, discrete, and nominal data). Our experiments confirmed that this method has a positive effect on the performance of RS image classification. However, MMAIQ adopts the simple equal width (EQW) algorithm to discretize continuous features, without taking class labels into account in the process of calculating the CV-test value [25]. Since EQW can be badly affected by data distribution and outliers, it may result in a nonoptimal discretization result and accordingly reduce the subsequent feature selection effectiveness [26]. In [27], the impact of the discretization methods on the rough set-based classification of remotely sensed data was reported, and the authors argued that a good discretization scheme could significantly reduce the large quantities of data and also enable image classification. Therefore, MMAIQ (in what follows, we refer to it as EQWAI to specify the discretized approach) could be further refined if a good discretization scheme was developed. Discretization is a process used to partition continuous features into a finite set of adjacent intervals and to unify the

value of each interval, which can be regarded as a problem of searching for suitable cut points on the feature domains [26]. Data discretization is a useful technique for RS preprocessing. The direct or indirect applications include data compression, feature extraction [28] and feature selection [29], image classification [25], and knowledge rule discovery [30], to name a few. Lei et al. [28] used the discrete rough set to extract texture information from RS data, and improved the performance of image classification. In [30], Leung et al. obtained the classification rule of spatial data with rough set theory. Furthermore, discretization usually acts as a feature selection method in that it can significantly impact the performance of classification [25]. Considering all these remarks, it is therefore an important task to develop an effective discretization algorithm to improve association-based feature selection methods for RS applications. This paper is an extension of previous approaches and develops two novel association-based feature selection indexes. Furthermore, performance evaluations with other classifiers are necessary to strengthen the robustness and effectiveness of the association-based indexes, considering that the properties of the feature selection approaches are generally dependent on the data sets and classifiers used. Therefore, this paper focuses on the following three issues that have not yet been touched in earlier work pertaining to classification applications. 1) The development of a novel discretization algorithm. It has been shown that supervised techniques are more beneficial to discretization than unsupervised ones since an unsupervised discretization scheme with no knowledge of the class labels often contains a mixture of class labels. Moreover, supervised methods will discretize a feature to a single interval if it has little or no correlation with the class labels. This effectively removes the variables as an input to the classification algorithm. To this end, a discretization method based on the CV-test criterion (CVD) is proposed to maximize the class–feature interdependence and to generate a minimal number of discrete intervals. 2) The development of novel feature selection indexes, according to the criteria of maximal and minimal association: the CVD-based association index (CVDAI) and the class-attribution interdependence maximization (CAIM)based association index (CAIMAI). Given different discretization schemes as a built-in function for discretization, different feature selection association indexes were obtained. The performances of CVDAI, CAIMAI, and EQWAI, which respectively take the CVD, CAIM, and EQW discretization schemes as the built-in function, were investigated and compared. In addition, three methods, mRMR, ReliefF, and F-score, were chosen for further comparison. 3) A comparison of the performance of classification using discretized features against original features. Normally, both discretized and raw feature subsets can be used in the classification process. Several researchers have shown that the use of discretized features is beneficial for the subsequent classification task if a classifier performed well with the discrete features [31], [32]. Since the importance of each feature is measured and selected by the

WU et al.: FEATURE SELECTION VIA CRAMER’S V-TEST DISCRETIZATION FOR RS IMAGE CLASSIFICATION

TABLE I I LLUSTRATION OF s × n C ROSS -TABULATION

variables. Given the cross-tabulation, as shown in Table I, the CV-test can be directly derived from the χ2 statistical test χ2 (2) V = M min {(n − 1), (s − 1)} where χ2 is defined as 2

χ =M

s n i=1 r=1

characteristics of the CV-test measurement, it is therefore of interest to investigate the benefits of CVD-based features in the context of RS classification. We selected the J48 decision tree (J48-DT) and naive Bayes (NB) classifiers to validate the classification performances because they are both well-known classifiers that perform well with both continuous and discrete variables. The rest of this paper is organized as follows. Section II describes the feature discretization algorithm. Section III discusses the feature selection indexes with the CV-test. Section IV provides the experimental data sets. Section V details the experiments with the two RS data sets. Finally, Section VI concludes this paper. II. F EATURE D ISCRETIZATION BASED ON THE CV-T EST A. CV-Test We assume that there exists a discretization scheme D on a continuous feature, which discretizes the continuous feature F into n discrete intervals bounded by the pairs of numbers {[d0 , d1 ], (d1 , d2 ], . . . , (dn−1 , dn ]}

2595

(1)

where d0 is the minimal value, dn is the maximal value of feature F , and the values in the boundary set {d0 , d1 , d2 , . . . , dn } for discretization D are arranged in an ascending order. Given the discretization scheme D, a 2-D frequency matrix called cross-tabulation or the quanta matrix can be obtained by treating the target classes and the discretized feature as two random variables. An illustration of s × n cross-tabulation is shown in Table I. In Table I, qir is the total number of continuous values belonging to the ith class that is within the interval (dr−1 , dr ], Mi+ is the total number of objects belonging to the ith class, and M+r is the total number of continuous values of feature F which are within the interval (dr−1 , dr ]. To examine whether there exists interdependence between two random variables, the cross-tabulation is usually calculated such that different statistical measurements can be defined according to it. The widely used χ2 test is one such measurement that is used to determine the dependence of two variables, and it has been demonstrated to be an effective measurement for feature selection [29]. However, the χ2 test for dependence is very sensitive to the sample size [33]; therefore, the CV-test is adopted to measure the strength of the relationship between the

2 qir −1 . Mi+ M+r

(3)

Although the CV-test can be regard as a normalized χ2 test, it has two main advantages with respect to the χ2 test. The first is that the CV-test is divided by the sample number M , resulting in insensitivity to the sample size, and it is therefore very useful in situations where one suspects that a statistically significant χ2 value is the result of a large sample size rather than any substantive relationship between the variables [33]. The other advantage is that it is easy to choose a threshold to determine the interdependence between two variables because of the value varying between 0 and 1. It has been suggested that, in practice, a CV-test of 0.1 provides a good minimum threshold for suggesting that there is a substantive relationship between two variables [34]. B. Discretization With the CV-Test Discretization generally falls into two distinct categories: unsupervised methods, such as EQW and k-means [26], which do not use any information in the class variable, and supervised ones, for example, CAIM [35] and entropy/minimum descriptive length [32], which partition continuous attributes into discrete variables involved with the class labels. Given the range of values of a continuous feature, a conceptually simple discretization approach is to place the splits in such a way that they maximize the purity of the intervals. In practice, however, such an approach requires a potentially arbitrary decision about the purity of an interval and the minimum size of an interval. Statistic-based approaches, which are often used to overcome these concerns, generally start with each attribute value as a separate interval and create larger intervals by merging the adjacent intervals that are similar, according to a statistical test. One such famous statistical criterion is CAIM [35], defined by cross-tabulation as follows: n

CAIM (C, D|F ) =

(maxr /M+r )

r=1

n

(4)

where n is the number of intervals, r iterates through all the intervals from 1 to n, and maxr is the maximum value among all the qir values. Although CAIM is a progressing discretization algorithm that does not require users to provide any parameters and, on average, outperforms other discretization schemes [35], it still suffers from the following drawbacks with regard to our purpose. First, a CAIM discretization scheme is inconsistent with the subsequent feature selection based on the CV-test association criterion. Second, the CAIM algorithm gives a high factor

2596


to the number of generated intervals, resulting in the number of intervals being very close to the number of target classes [36]. Finally, in each discretized interval, CAIM considers only the class with the largest number of samples while ignoring all the other target labels. Such a consideration would, in some cases, decrease the quality of the produced discretization scheme. Due to these observations, this paper uses the CV-test discretization criterion to measure the dependence between the target class C and the discretization variable D for feature F , which is defined as follows: CV D(C, D|F ) = χ2CV / {M · min [(n − 1), (s − 1)]} (5) where χ2CV = χ2 / ln(n). We divide the χ2 by ln(n) for two main reasons. The first reason is to speed up the discretization process, and the other is to reduce the huge influence of n in the discretization scheme. By a comparison of (4) and (5), it can be seen that the CVD criterion takes the distribution of all the sam2 /(Mi+ M+r ), while CAIM only ples into account by using qir considers the maximal samples. As a result, CVD is expected to have a more powerful ability to capture the interdependence between class and feature. Like CAIM, the CVD criterion is a heuristic measure quantifying the interdependence between classes and the discretized feature. The larger the value of CVD is, the higher the interdependence between the class labels and the discrete intervals. C. CVD Algorithm Inspired by CAIM [35], the CVD algorithm starts with each feature value as a separate interval and creates larger intervals by merging adjacent intervals that are similar, according to the CV-test. For each feature, it uses a greedy approach to search for the approximate optimal value of the CVD criterion by finding locally maximum values of the criterion. Given the input data F = {F1 , . . . , Fi , . . . , Fd }, each feature Fi has , . . . , fin }. The flowchart of the n different values Fi = {fi1 proposed CVD algorithm is shown in Fig. 1. From Fig. 1, it can be seen that, for each feature, the CVD algorithm’s time bound is dominated by the calculation of the CVD criterion. In the worst case, the CVD is calculated in O(n · s) times, where n is the number of distinct values of the discretized feature and s is the number of classes. Since the expected number of intervals per feature is O(s), the time bound for the calculation of the CVD value can be estimated as O(s2 ). Thus, the time bound of the CVD algorithm can be estimated as O(n · s3 ), in the worst case. III. F EATURE S ELECTION I NDEXES The CV-test provides a useful measurement to judge and select the most important features with respect to the interdependence of target classes. However, it has long been recognized that the selection of individually good features does not necessarily lead to a good classification performance because they are very likely to include redundant features if highly dependent features are not excluded. Therefore, we construct a feature selection criterion according to the criteria of maximal

Fig. 1.

Flowchart of the proposed CVD algorithm.

association between target labels and, at the same time, keep a minimal association between selected features, with respect to the CV-test [25], which is mathematically expressed as max ω(Fi ) = |Ω| V (Fi , C) V (Fi , Fj ) (6) Fi ∈Ω

Fi ,Fj ∈Ω

where |Ω| is the number of elements in the selected feature subsets, Fi ∈Ω V (Fi , C) is the CV-test between individual features Fi and class C, and Fi ,Fj ∈Ω V (Fi , Fj ) is the CV-test coefficient between features Fi and Fj . From (6), it is clear that, by employing different discretization schemes, it is possible to produce varying feature selection subsets because the different discretized results can cause CVtest variation. Specifically, the CVD, CAIM, and EQW discretization algorithms are used as built-in functions to construct the corresponding CVDAI, CAIMAI, and EQWAI indexes. Intuitively, all the indexes have maximal target class associated abilities and, at the same time, minimal redundancy between the pairs of selected features. As a result, they are better able to exclude redundant features in the feature selection process. To achieve a feature selection subset, an incremental manner is used to find the near optimal features defined by (7). Supposing that we already have Ωp−1 , the feature set with p − 1


Fig. 2.

2597

Pseudocode of CVDAI.

features, to select the pth feature, the incremental algorithm optimizes the following conditions: ⎡ ⎤ ⎣(p − 1)V (Fj , C) max V (Fj , Fi )⎦ . (7) Fj ∈F −Ωp−1

Fig. 3. Information about the QuickBird image from Fuzhou, China (FZD): (a) Natural-color image and segmented results and (b) referenced image.

xi ,xj ∈Ω

A greedy optimization is used to speed up the searching procedure, and the pseudocode of CVDAI for feature selection is shown in Fig. 2. It is clear from Fig. 2 that the feature selection can be optimized efficiently in O(|Ω| · d) complexity, where d is the number of features. As a result, we can obtain the ranked features rapidly, even if the dimension of the features is very high. IV. DATA C OLLECTION Experiments were conducted to test the performance of the proposed algorithms, using a high spatial resolution image (HSRI) and two hyperspectral RS images. The first data set was a subset image with 3859 × 2806 pixels, chosen from a QuickBird image of Fuzhou, China, and it is referred to as the FZD data set in what follows. This data set was acquired in June 2003 and comprises a panchromatic image with 0.6-m spatial resolution and multispectral imagery with 2.4-m resolution. In order to make full use of the spectral and spatial information, the panchromatic and multispectral images were fused, and a 0.6-m multispectral image was obtained. According to the ground-truth data, the test site contains typical parcels in urban areas, with different types of roads, vegetation patches, water, building areas, bare land, and shadow regions. The FZD image and the corresponding referenced image are shown in Fig. 3(a) and (b), where the referenced image was generated by human interpretation and validated by field investigation. Because pixel-based classification schemes employing only spectral information often result in lower classification accuracy for HSRI [37], we therefore adopted an object-based classification method to achieve relatively high classification accuracy. The image was first segmented into various objects using the fractal network evolution approach, with the aid of the Definiens Professional 5.0 tool. Various features, including spectral, shape, texture, and index features, were then generated from these segmented objects, which add up to 82 features. Detailed information about these 82 features can be found in [25]. The total number of segmented objects in the FZD image was 7141 [see Fig. 3(a)], where 2981 samples with

Fig. 4. Information about the Xiaqiao hyperspectral image (XQD): (a) Pseudocolor image with 65, 48, and 27 bands; (b) corresponding spectral curves of the classes; and (c) identified classes based on field investigation.

ground investigation were employed for feature discretization and classification accuracy assessment. The second data set (XQD) was acquired in September 1999 from the Xiaqiao test site, a mixed agricultural area in Changzhou city, Jiangsu province, China, and is airborne pushbroom hyperspectral imagery (PHI). A subscene (346 × 350 pixels) of the PHI image with 80 bands was tested, with a spectral range from 417 to 854 nm. The ground-truth spectral data were collected in September 1999 by the field spectrometer SE590. The observed image was expected to classify into eight representative classes, i.e., corn, vegetable—sweet potato, vegetable—cabbage, soil, grass, road, water, and puddle/polluted water. A total of 4308 available samples were used for the subsequent performance evaluation. Fig. 4 shows the XQD image [see Fig. 4(a)], the average

2598


TABLE II M AJOR P ROPERTIES OF THE DATA S ETS U SED IN THE E XPERIMENTS

Fig. 5. Information about the AVIRIS data set: (a) Gray image with band 35; (b) corresponding spectral curves of the classes; and (c) available ground-truth map of the scene.

spectral curves of the classes [see Fig. 4(b)], and the corresponding test fields [see Fig. 4(c)]. The third data set used was the classical 220-band Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) image (see Fig. 3) taken over Indiana’s Pine Site (IPS) in June 1992. The image is 145 × 145 pixels, containing 16 crop-type classes and a total of 10 366 labeled pixels. This image is a classical benchmark used to validate model accuracy and constitutes a very challenging classification problem because of the strong mixture of the classes’ signatures and the unbalanced number of labeled pixels per class. Since all the pixels have been labeled by fieldwork, many researchers have used these data to validate their algorithms, from various aspects. In the preprocessing phase, some water absorption and noise bands were initially excluded for brevity. Thus, the final set contained a total of 185 remaining spectral bands for further analysis. Since the sizes of the samples in some classes were too small for training and testing, only ten classes were considered: Corn-min, Corn-notill, Grass/Pasture, Grass/Tree, Haywindrowed, Soybeans-min, Building/grass, Soybeans-clean, Soybeans-notill, and Woods. The available labeled samples for these classes amounted to 9275 pixels. Among them, about 300 samples of each class were randomly selected for the experiment, and the spatial distribution of the samples for each class is shown in Fig. 5. Table II summarizes the number of samples for each class in the experiment. V. E XPERIMENTAL R ESULTS To explore the effectiveness of the proposed methodologies, both feature discretization and feature selections were performed using all the data sets. The primary objectives

of the experiments reported in this section were as follows: 1) evaluating the performance of CVD discretization; 2) testing the improvement of feature selection of CVDAI and CAIMAI, compared with EQWAI; 3) comparing the association-based feature selection methods, CVDAI, CAIMAI, and EQWAI, with three other methods, ReliefF, mRMR, and F-score; 4) analysis of selected feature subsets in terms of class separable information and remaining feature redundancy; and 5) testing and comparing the performances of using CVD-based original features against discretized features. A. Performance of the Discretization Schemes As stated in Section I, a good discretization scheme usually has the properties of maximal class–feature interdependence [36]. We therefore adopted the class–attribute interdependence redundancy (CAIR) value [38] to evaluate the quality of the discretization. The CAIR criterion can effectively represent the interdependence between the target classes and discretized features and is widely used to measure the quality of a discretization scheme [35] s n s n pir /(pi+ p+r ) pir log2 (−pir logp2ir ) CAIR = i=1 r=1

i=1 r=1

(8) where pir = qir /M , pi+ = qi+ /M , and p+r = q+r /M in Table I. It is known from (8) that the higher the value of CAIR, the higher the interdependence between the class labels and the discrete intervals. For comparison, two other widely used discretization algorithms, namely, EQW and CAIM, were also implemented. The reason for selecting the two algorithms is that CAIM is a state-of-the-art supervised discretization algorithm [35] while EQW is the simplest approach and is an effective unsupervised method [26]. In the discretization procedures, the supervised CVD and CAIM algorithms apply their own criteria to adaptively generate an appropriate number of discrete intervals, while the unsupervised EQW algorithm requires the user to specify the number of discrete intervals. In our experiments, we used the CAIR value to determine the optimal intervals from 2 to 50. Fig. 6 shows the CAIR values with different intervals of the EQW discretization results. It can be seen that the optimal


2599

TABLE IV C OMPARISON OF OVERALL ACCURACIES U SING E NTIRE D ISCRETE F EATURES BY THE J48-DT, AND NB C LASSIFIERS FOR THE T HREE D ISCRETIZATION S CHEMES , W ITH THE FZD, XQD, AND IPS DATA S ETS

Fig. 6.

Determination of the optimal intervals for EQWAI. TABLE III C OMPARISON OF THE T HREE D ISCRETIZATION S CHEMES W ITH D IFFERENT M EASUREMENTS

intervals for FZD, XQD, and IPS were 6, 9, and 7, respectively. The discretized results with the CVD, CAIM, and EQW algorithms are shown in Table III, where the bold values indicate the best results. As can be seen from Table III, the supervised techniques outperformed the unsupervised one. Specifically, the CVD algorithm achieved the highest CAIR values for the FZD, XQD, and IPS data sets, with 0.113, 0.305, and 0.217, respectively, which indicates the highest class–feature interdependence. This is an important characteristic for a discretization scheme, because a higher CAIR value usually achieves a higher classification accuracy. In contrast, the discretization scheme with EQW obtained the lowest values of 0.10, 0.297, and 0.209, respectively. This is reasonable in that EQW partitions features into a fixed number of equal intervals, without knowledge of the class information. To further quantify the impact of the different discretization algorithms on the subsequent classification task performed by the machine learning algorithms, these discretized schemes were used to produce the respective classification accuracies by the J48-DT and NB classifiers, without feature selection. Therefore, the variance in results is entirely due to the discretization algorithms. We selected the J48-DT and NB classifiers for classification because they are both in the top 10 voted classification methods, due to their simplicity, elegance, and robustness, and they are available for discrete variables [39]. The classifiers were implemented with the help of the WEKA toolbox [40], which was developed at the University of Waikato in New Zealand. In our experiments, the splitting criterion for J48-DT was the default impurity measure since the properties of the resulting final decision tree are insensitive to the choice of the splitting criterion. Another critical factor in designing a decision tree is its size. We adopted the commonly used approach of growing a tree up to a large size first and then pruning the nodes according to the pruning parameter. The other implemented parameters of the decision tree were as follows: The confidence threshold for

pruning was equal to 0.25, the minimum number of instances per leaf was set to 2, and the number of folds for reduced error pruning was equal to 3. On the other hand, NB requires the user to input few parameters and provides competitive prediction accuracies [41]. To reduce the deviation in our experiments, tenfold cross-validation (CV) tests and average overall accuracy were used to estimate the performance. A direct comparison of the results can be seen from Table IV by inspecting the third, fifth, and seventh columns, which show the average overall accuracies for the FZD, XQD, and IPS data sets, respectively. It is clear that the best accuracies for the two classifiers were both achieved by using the CVD algorithm. The second best results were obtained by the CAIM algorithm, and EQW had the lowest accuracy. The best accuracies for FZD conducted by the J48-DT and NB classifiers were 88.29% and 84.07%, respectively, which are increases of 1.8% and 2.5% compared with that of the EQW algorithm. The same results were observed for XQD, where the accuracies derived from the CVD-based discretization scheme were 88.93% and 89.78%, respectively, which are increases of 1.1% and 2.2% compared with that of EQW. For IPS, the classification accuracies of the J48-DT and NB classifiers were 68.83% and 60.39, and an increase on EQW by 3.39% and 4.45% could be observed. This should not be a surprise since an interval constructed with no knowledge of the class labels will possibly contain a mixture of class labels. B. Improvement of the Proposed Algorithms In this section, the feature selection performances of CVDAI, CAIMAI, and EQWAI are evaluated. In addition, to examine their overall accuracies generated from the J48-DT and NB classifiers, SVM was also implemented to show the influence of the association-based indexes on the subsequent feature selection and classification. SVM is one of the most popular classifiers and is often used as a benchmark method for RS classification. Therefore, a multiclass one-versus-one SVM classifier was performed on the original image data using the Gaussian RBF kernel. Although the pixelwise SVM classifier usually gives good classification accuracies, it must be noted that it is a computationally demanding algorithm for high-dimensional data and/or when the training samples are large. The parameters of C and γ were therefore tuned by threefold CV to reduce the computational cost, whose searching

2600


Fig. 7. Comparison of the overall classification accuracies of the different feature selection approaches.

ranges were within {e−5 , e−3 , . . . , e11 } and {e−8 , e−6 , . . . , e4 }, respectively. To complete such a task, CVDAI, CAIMAI, and EQWAI were used to generate their respective feature subsets. These feature subsets of the FZD, XQD, and IPS data sets were then incrementally used, according to the respective feature ranks, to evaluate the performances. We use the overall accuracy to evaluate the effectiveness of the selected features because it is popular in the performance evaluation of classification accuracy. However, to determine which feature subset is superior, we argue that it is insufficient to just compare the overall classification accuracy for a specific size of the feature sets or the best accuracy generated from different sizes of the feature sets. A better way is to observe which subset is recursively more characteristic for a reasonably large range. For this purpose, the feature subset space should first be characterized. Given two feature subsets Ω1k and Ω2k both containing k features generated from two different algorithms, we say that the feature set of Ω1k is superior if the overall accuracy with Ω1k

is better than with Ω2k . We assume that the first algorithm generates a series of feature subset spaces in Ω1k , Ω11 ⊂ Ω12 ⊂ · · · ⊂ Ω1k , and, similarly, the second algorithm produces another series of subsets in Ω2k , Ω21 ⊂ Ω22 ⊂ · · · ⊂ Ω2k . We say that the Ω1k is recursively better than Ω2k in the range Δ = [k1 , k2 ] if, for every subset in this range, the measurement with Ω1k is consistently better than that with Ω2k . Since the main purpose of feature selection is to obtain the most informative features from all the candidate features, we focus on a small number of features and therefore evaluate the range Δ = [1, hf ], where hf is the half size of the respective feature dimension. Consequently, the overall accuracy as a function of the number of features for the different feature selection approaches is plotted in Fig. 7. From the overall classification curves in Fig. 7, it is clear that, in most cases, CVDAI achieved the best performance for all the data sets, in terms of the J48-DT, NB, and SVM classifiers. CAIMAI achieved the second best performance, where, in most cases, the curves of CAIMAI were also higher than those of EQWAI under different feature numbers. These results


consistently demonstrate that the association-based methods with supervised CVD and CAIM discretization schemes, which are involved with the classes’ labeled information, generally lead to a classification accuracy improvement, compared with the unsupervised EQW method. These observations are in accordance with Tables III and IV. A closer examination shows that CVDAI and CAIMAI obtained a relatively high accuracy with smaller features. The overall accuracies of J48-DT, NB, and SVM for the FZD data set were 89.1%, 90.0%, and 92.3% with 12, 16, and 10 features, respectively. For XQD, the optimal feature turning points of the classification curve were 16, 9, and 10, where the overall accuracies were 89.9%, 91.3%, and 95.4%, respectively. The turning points of the curves for the IPS data set were relatively indistinct, except for the NB classifier, where the overall accuracy was 60.4% with 13 features. C. Accuracy Validation of Association-Based Feature Selection To further demonstrate the effectiveness of the associationbased feature selection methods, CVDAI, CAIMAI, and EQWAI, three other widely used feature selection algorithms, F-score, ReliefF, and mRMR, were implemented for comparison purposes. The F-score, mRMR, and ReliefF algorithms were chosen for comparison because of their effectiveness, representativeness, and popularity. F-score [13] is a simple and effective algorithm for measuring the discrimination between a feature and the class label, which selects features that assign similar values to the samples from the same class and different values to samples from different classes. The evaluated criterion in F-score can be formulated as C C 2 nj (μi,j − μi ) nj σi,j (9) S(Fi ) = j=1

j=1

where μi is the mean of the feature Fi , nj is the number of samples in the jth class, and μi,j and σi,j are the mean and variance of Fi on the jth class, respectively. A larger value of F-score indicates that the feature is more significant. A known deficiency of F-score is that it considers each feature separately and cannot therefore effectively reveal the mutual information among features. The key idea of ReliefF is to estimate the quality of features, according to how well their values distinguish between instances that are near to each other. For that purpose, ReliefF searches for its two nearest neighbors, one from the same class N H(Fi , C) and the other from a different class N M (Fi , C), to adjust the feature weighting vector and give more weight to features that discriminate the instance from their neighbors belonging to different classes [16] ⎧ P ⎨ 1 1 R(Fi ) = dif f (ft,i , fj.i ) − zP t=1 ⎩ k xi ∈N H(F ) ⎫ ⎬ 1 P (C) + dif f (ft,i , fj.i ) (10) ⎭ k 1 − P (Ci ) xj ∈N M (Ft ,C)

2601

where Ci is the class label of the sample xt and P (C) is the probability of a sample being from class C. ft,i denotes the values of xt on feature Fi , and dif f (·) is the function used to calculate the difference between ft,i and fj.i . Usually, the sizes of N M (F, C) and N H(Fi , C) are equal and prespecified by a constant k. In our experiments, the dif f (·) function was specified by the Euclidean distance. Two parameters are required to be determined with respect to ReliefF application: the drawing size P , which is related to the problem complexity, and the neighboring parameter k, which is related to the distance described earlier. In our experiments, the parameter P was set to the number of training samples M so that all possible samples were used. The parameter k is insensitive, and the default value 10 gave remarkable results for all the data sets. The mRMR algorithm is another supervised weighting algorithm, which can be regarded as an approximation of the maximum dependence measured by conditional entropy between the joint distribution of the selected features and the classification target [14], [15]. The optimization criterion of mRMR is as follows:

max

R(Fi ) = |Ω|

xi ∈S

I(Fi , C)

I(Fi , Fj ) (11)

xi ,xj ∈S

where Fi and C are multidimensional random variables, I(Fi , C) is the mutual information value between individual features Fi and class C, and I(Fi , Fj ) is the mutual information value between features Fi and Fj . |Ω| is the number of selected features. Compared with (6) and (11), it can be seen that their definitions are analogous, but the criterion used to measure feature–class and feature–feature correlation is completely different. The results of overall accuracy versus the number of features are also plotted in Fig. 7. It can be seen that the classifications of CVDAI, CAIMAI, and EQWAI were more accurate than the F-score, mRMR, and ReliefF approaches, particularly when the number of features was relatively small. However, as can be observed from Fig. 7, their accuracy discrepancies were much reduced with the increase in the number of features. It also appears from Fig. 7 that our methods provide features which are relatively insensitive to the precise choice of feature size since the accuracy versus dimensionality curves are relatively flat beyond the initial knee of the curve. In general, the proposed methods were able to generate a relatively high accuracy with a very small number of selected features.

D. Quantifying the Comparison of the Performances To quantitatively evaluate the performances, a quantification metric was addressed to evaluate the average performance under different feature numbers since the curves corresponding to the various methods overlap each other under different feature numbers. To this end, we introduced and improved the metric in [42] to combine the overall accuracy of the selected features and the corresponding information, to assess the whole performance at each selection step.

2602


TABLE V B EST ACCURACIES OF THE T EST DATA S ETS B ETWEEN THE R ESULTS U SING E NTIRE F EATURES . I N THE TABLE, “acc” AND “macc” D ENOTE THE AVERAGE OVERALL ACCURACY AND E NTIRE M ETRIC. “num” D ENOTES THE N UMBER OF F EATURES C ORRESPONDING TO THE H IGHEST ACCURACY

Let H(i) denote the information entropy as i i H(i) = − log |Ω| |Ω|

(12)

where i represents the features that are selected and |Ω| is the number of evaluating features. The metric, which is denoted as “macc,” is as follows: macc =

|Ω|

acci · H(i)

(13)

i=1

where acci is the given overall accuracy under i features and |Ω| H(i) = H(i)/ i=1 H(i). Using the “macc” metric as a holistic measurement, the entire performance of the algorithms can be quantitatively evaluated. It is known from (13) that the metric assigns more weight for a small number of feature subsets. As a result, “macc” can check the performance across all the dimensions and simultaneously focus on the small number of features which are preferred in the classification application. The quantitative results of the aforementioned methods are reported in Table V. The highest overall accuracies and “macc” metric values are shown in bold. In addition, the numbers of features corresponding to the highest accuracies are also presented. It can be seen from Table V that all the associationbased methods, CVDAI, CAIMAI, and EQWAI, generally outperformed the other methods. Notably, CVDAI achieved the highest overall accuracies in terms of the J48-DT, NB, and SVM classifiers: 89.14%, 90.03%, and 92.38% for FZD; 89.81%, 92.31%, and 96.65% for XQD; and 69.5%, 60.78%, and 87.33% for IPS. Moreover, CVDAI obtained the best results, in all cases, in terms of the “macc” measurement,

indicating the best performance under different feature numbers. These results demonstrate that CVDAI outperforms the other methods. It can also be observed that the associationbased algorithms, CVDAI, CAIMAI, and EQWAI, achieved relatively high classification accuracies when compared with the ReliefF, mRMR, and F-score methods, in terms of the “macc” measurement. E. Analysis of Selected Feature Subsets To analyze and better understand the performance of the proposed methods, we assessed the quality of the selected features with two metrics: the class separability (the scatter metric), which was used to measure the discrimination effectiveness of selected features [17]. and the redundancy rate, which is the remaining redundancy contained in the selected features. The class separability measurement needs class information to judge the quality of the selected features, while the redundancy rate does not. We assume that Sw is the within-class scatter matrix, and Sb is the between-class scatter matrix. They are defined as follows: Sw =

c

c πi E (F − μi )(F − μi )T |ωi = πi Σi

i=1

Sb =

c

i=1

(μi − M0 )(μi − M0 )T

i=1

M0 = E{F } =

c

πi μ i

(14)

i=1

where πi is the a priori probability that a sample belongs to class ci and F is the feature vector. μi denotes the sample mean


Fig. 8.

Class separability in the selected subset features.

Fig. 9.

Redundancy contained in the selected subset features.

vector of class ci , and M0 is the mean vector of all the data samples. Σi is the sample covariance matrix of class ci , and E{•} represents the expectation operator. The class separability index J of a data set is defined by −1 (15) Sb . J = trace Sw The class separability of the respective feature subsets generated from the aforementioned methods is reported in Fig. 8. It can be seen from Fig. 8 that, in most cases, the J values of CVDAI and CAIMAI are higher than those of EQWAI, ReliefF, mRMR, and F-score for the XQD and IPS data sets. A higher value of the separability criteria usually ensures that the classes are well separated by their scatter means, and hence, it benefits the subsequent classification performance. It can also be seen from Fig. 8 that, for the FZD data set, ReliefF, mRMR, and F-score obtained higher values when the feature number ranged from 20 to 36. However, the proposed methods obtained the highest J values with the other feature numbers. Assuming that Ω is the set of the selected features and Fi denotes the data that only contain the features in Ω, the redundancy rate of Ω is defined as 1 R(Ω) = ci,j (16) |Ω| (|Ω| − 1) Fi ,Fj ∈Ω,i>j

where ci,j is the correlation between two features Fi and Fj . This metric assesses the averaged correlation between all the feature pairs, and a large value indicates that many selected features are strongly correlated, and therefore, redundancy is expected to exist in Ω.

2603

The redundancy rates contained in the selected subset features with the increasing number of features for the different approaches are plotted in Fig. 9. It is clear that, for the FZD and IPS data sets, the feature subsets generated from CVDAI and CAIMAI have the lowest feature redundancy in terms of our measurement. The feature subset with EQWAI has an average redundancy rate, while the feature subsets with ReliefF, mRMR, and F-score contain much redundancy, particularly when the number of selected features is relatively small. The same results can be observed with the XQD data set, except for EQWAI, which has the same redundancy rate as CVDAI and CAIMAI. It can also be found that the feature redundancies of F-score are the highest values for almost all the data sets. This is because F-score cannot effectively reveal the mutual information among features in the feature selection process. These observations indicate some of the reasons that the proposed CVDAI and CAIMAI methods outperform EQWAI, ReliefF, mRMR, and F-score. In addition, since the NB classifier is greatly dependent on the assumption of independence among features, it inevitably improves the performance significantly if the feature subset contains little redundancy. On the other hand, J48-DT can reduce feature redundancy by a pruning technique in the classification procedure. Therefore, the improvement of NB is more significant than that of J48-DT. These results are in accordance with Fig. 7. F. Benefit of the Discretized Features With empirical evidence [43], a good discretization can help to significantly improve the classification performance of some algorithms such as NB and semi-NB, which are sensitive to

2604


Fig. 10. Discrepancies in the classification accuracy between original and discretized features.

the dimensionality of the data. Obviously, we had both original and discrete ranked feature data sets available after the discretization and feature selection procedures. Therefore, the quantitative performances of classification from original (continuous) features against discretized features were compared. SVM is generally insensitive to data dimensionality, and to our knowledge, we have no evidence to show that SVM can perform well with discrete data. Therefore, in this section, we only examine the benefits of using the J48-DT and NB classifiers with the CVD-based discrete features for the subsequent comparison. It must also be noted that the discretization processing possibly results in a loss in information, which may result in a reduction in the classification accuracy for the SVM classifier. As a result, the benefits can only be obtained when the algorithms perform better with discrete variables. To this end, the feature rankings with the CVDAI, ReliefF, mRMR, and F-score methods were first implemented, and then, the CVD-based discretized features and corresponding continuous features were respectively used to test their classification performances under different numbers of features. Therefore, the discrepancies in the classification accuracy were only caused by the input data because of the use of the same sequential number of features. Note that CAIMAI and EQWAI were not used for comparison since they are the same as CVDAI if a CVD-based discretized scheme is used. The average classification results generated from both the continuous features and the discretized features are reported in Fig. 10. It can be observed from Fig. 10 that, in most cases, the overall accuracies generated by the J48-DT and NB classifiers with discretized features, for all the data sets, were higher than that with the use of the original features. To further judge whether the discrepancies between them are statistically

significant, we utilized the nonparametric McNemar test [44] to assess the statistical significance in accuracy improvement. This test is based on the standardized Z-test statistic c12 − c21 Z=√ c12 + c21

(17)

where c12 denotes the number of samples classified correctly and incorrectly by the continuous feature-based model and the discrete feature-based models, respectively. Accordingly, c12 and c21 are the counts of the classified samples on which the considered first and second models disagree. A lower prediction error (higher accuracy) is identified by the sign on Z. A negative sign indicates that the results from c12 are more accurate than the results from model c21 . At the commonly used 5% level of significance, the difference in the accuracies between the first and second models is evaluated to be statistically significant if |Z| > 1.96. This experiment helps us determine whether the discrepancies in classification accuracy are significant. The average Z-test values for the original features against the discretized ones, with the different classifiers and data sets, are shown in Table VI. From Table VI, it is clear that, in most cases, the discretized features significantly improve on the original features with the NB and J48-DT classifiers, except for ReliefF and F-score for XQD and CVDAI for IPS, where the Z-values are −1.82, −1.84, and −1.59, which are close to the statistical threshold at the commonly used 5% level of significance test. All the cases are associated with the J48-DT classifier. It must be noted that the Z-values in Table VI are the average values of all the cases from one to all the features. We can see from Fig. 10 that the discrepancies between the original features and the discretized features are significant with small numbers of features.


TABLE VI AVERAGE Z-VALUES FOR THE O RIGINAL AND D ISCRETIZED F EATURES W ITH THE D IFFERENT C LASSIFIERS AND DATA S ETS , IN T ERMS OF THE McNemar T EST

Discretization benefits the J48-DT classifier since discretized values normally shorten the decision rules, reduce the uncertainty of the input variables, and can lead to improved predictive accuracy [31]. However, as an embedded technique, J48-DT discretizes the continuous features in the classification procedure. Therefore, the benefit of using CVD-based discretized features with J48-DT is not better than that with NB. Discretization is more effective to NB because discretized data can provide more accurate probability density estimation than the original data if the forms of the classes’ probability distribution function are not known. Intuitively, a discretized feature usually takes a smaller number of values and tends to have sufficient representative data to approximate the class condition probability. On the other hand, a continuous feature tends to have a larger or even infinite number of values. Accordingly, there are usually very few training instances for any one value with limited samples, resulting in an unreliable estimation of the class probability. VI. C ONCLUSION AND F UTURE W ORKS It is well known that identifying the most characteristic features of the observed data set is critical to RS classification. In this paper, we investigated a novel feature discretization method (CVD) and two feature selection indexes (CVDAI and CAIMAI) for optimal feature subset selection. Taking CVD (CAIM) discretization as a built-in function, CVDAI (CAIMAI) obtains the optimal subsets simultaneously satisfying the criteria of maximal associations between the target labels and minimal associations between the selected features, with respect to the CV-test. The performance tests indicate that the CVD scheme has the ability to generate the highest class– feature interdependence. Moreover, taking CVD as a built-in function, CVDAI can significantly improve the performance of RS classification accuracies associated with the J48-DT, NB, and SVM classifiers, when compared with EQWAI. The experiments also demonstrate that CVDAI (CAIMAI) performs significantly better than mRMR and ReliefF, in terms of overall accuracies. Moreover, our tests also show that the use of CVDbased discretized features can further improve the classification accuracy of some classifiers if they perform well with discretized data (e.g., NB), in terms of the McNemar test. Some further issues will be addressed in our future works. First of all, CVDAI and CAIMAI are computationally complex, compared with the other methods. The discretization of continuous features with labeled information and the formation

2605

of cross-tabulations to obtain the joint association between features are the two dominant calculation steps. A parallel implementation of our algorithm could help to relieve this problem. Second, the proposed scheme adopts an incremental forward selection method. This can cause the selected features to be unstable because the selected features cannot be dropped off once they are selected. In our future work, we intend to improve the stability by introducing a mechanism to reselect features by the use of a backward refining process from already selected features. Finally, in the future, we intend to find common features from different feature selection methods. ACKNOWLEDGMENT The authors would like to thank D. A. Landgrebe at Purdue University for providing the AVIRIS data. The authors would also like to thank the anonymous reviewers for their insightful comments that have been very helpful in improving this paper. R EFERENCES [1] J. A. Benediktsson, M. Pesaresi, and K. Amason, “Classification and feature extraction for remote sensing images from urban areas based on morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 41, no. 9, pp. 1940–1949, Sep. 2003. [2] Y. Zhao, L. Zhang, P. Li, and B. Huang, “Classification of high spatial resolution imagery using improved Guass Markov random-field-based texture features,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 5, pp. 1458–1468, May 2007. [3] A. Plaza, P. Martínez, J. Plaza, and R. Pérez, “Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 4, pp. 466–479, Mar. 2005. [4] L. Zhang, X. Huang, B. Huang, and P. Li, “A pixel shape index coupled with spectral information for classification of high spatial resolution remotely sensed imagery,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 10, pp. 2950–2961, Oct. 2006. [5] A. P. Jose, C. Manuel, and Z. W. James, “Feature selection in AVHRR Ocean satellite images by means of filter methods,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 12, pp. 4193–4203, Dec. 2010. [6] G. F. Hughes, “On the mean accuracy of statistical pattern recognizers,” IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968. [7] S. B. Serpico and L. Bruzzone, “A new search algorithm for feature selection in hyperspectral remote sensing image,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 7, pp. 1360–1367, Jul. 2001. [8] A. Ifarraguerri and C. Chang, “Unsupervised hyperspectral image analysis with projection pursuit,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 6, pp. 2529–2538, Nov. 2000. [9] J. Wang and C. Chang, “Independent component analysis-based dimensionality reduction with applications in hyperspectral image analysis,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 6, pp. 1586–1600, Jun. 2006. [10] B. C. Kuo and D. A. Landgrebe, “Nonparametric weighted feature extraction for classification,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5, pp. 1096–1105, May 2004. [11] M. Renard and S. Bourennane, “Dimensionality reduction based on tensor modelling for classification methods,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 4, pp. 1123–1131, Apr. 2009. [12] M. Pal, “Support vector machine-based feature selection for land cover classification: A case study with DAIS hyper-spectral data,” Int. J. Remote Sens., vol. 27, no. 14, pp. 2877–2894, Jul. 2006. [13] R. Duda, P. Hart, and D. Storl, Pattern Classification, 2nd ed. New York, NY, USA: Wiley, 2011. [14] B. Wu, Z. Xiong, Y. Chen, and Y. Zhao, “Classification of quickbird image with maximal mutual information feature selection and support vector machine,” Procedia Earth Planet. Sci., vol. 1, no. 1, pp. 1165–1172, Sep. 2009. [15] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and minredundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, Aug. 2005.

2606


[16] M. Robnik-Sikonja and I. Kononenko, “Theoretical and empirical analysis of ReliefF and RreliefF,” Mach. Learn., vol. 53, no. 1/2, pp. 23–69, Oct. 2003. [17] P. Mitra, C. A. Murthy, and S. K. Pal, “Unsupervised feature selection using feature similarity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 301–312, Mar. 2002. [18] A. Jain and D. Zongker, “Feature selection: Evaluation, application, and small sample performance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 2, pp. 153–158, Feb. 1997. [19] F. M. B. Van Coillie, L. P. C. Verbeke, and R. R. De Wulf, “Feature selection by genetic algorithms in object-based classification of IKONOS imagery for forest mapping in Flanders, Belgium,” Remote Sens. Environ., vol. 110, no. 4, pp. 476–487, Oct. 2007. [20] D. Tuia, G. Camps-Valls, G. Matasci, and M. Kanevski, “Learning relevant image features with multiple-kernel classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 10, pp. 3780–3791, Oct. 2010. [21] D. Tuia, F. Pacifici, M. Kanevski, and W. Emery, “Classification of very high spatial resolution imagery using mathematical morphology and support vector machines,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 11, pp. 3866–3879, Nov. 2009. [22] R. Archibald and G. Fann, “Feature selection and classification of hyperspectral images with support vector machines,” IEEE Geosci. Remote Sens. Lett., vol. 4, no. 4, pp. 674–677, Oct. 2007. [23] M. Fauvel, J. Benediktsson, J. Chanussot, and J. Sveinsson, “Spectral and spatial classification of hyperspectral data using SVMs and morphological profiles,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 11, pp. 3804– 3814, Nov. 2008. [24] B. Waske, S. van der Linden, J. Benediktsson, A. Rabe, and P. Hostert, “Sensitivity of support vector machines to random feature selection in classification of hyperspectral data,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 7, pp. 2880–2889, Jul. 2010. [25] B. Wu, X. Wang, H. Shen, and X. Zhou, “Feature selection based on max–min-associated indexes for classification of remotely sensed imagery,” Int. J. Remote Sens., vol. 33, no. 17, pp. 5492–5512, Sep. 2012. [26] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Upper Saddle River, NJ, USA: Pearson Educ., Inc., 2006. [27] Y. Ge, F. Cao, and R. F. Duan, “Impact of discretization methods on the rough set-based classification of remotely sensed images,” Int. J. Digit. Earth, vol. 4, no. 4, pp. 330–346, Jul. 2011. [28] C. Lei, S. Wan, and T. Y. Chou, “The comparison of PCA and discrete rough set for feature extraction of remotely sensed imagery classification—A case study on rice classification, Taiwan,” Comput. Geosci., vol. 12, no. 1, pp. 1–14, Mar. 2008. [29] H. Liu and R. Setiono, “Feature selection via discretization,” IEEE Trans. Knowl. Data Eng., vol. 9, no. 4, pp. 642–645, Jul./Aug. 1997. [30] Y. Leung, T. Fung, J. Mi, and W. Z. Wu, “A rough set approach to the discovery of classification rules in spatial data,” Int. J. Geogr. Inf. Sci., vol. 21, no. 9, pp. 1033–1058, Oct. 2007. [31] J. Catlett, “On changing continuous attributes into ordered discrete attributes,” in Proc. Eur. Working Session Learn., 1991, pp. 164–178. [32] U. M. Fayyard and K. B. Irani, “On the handling of continuous-valued attributes in decision tree generation,” Mach. Learn., vol. 8, no. 1, pp. 87– 102, Jan. 2002. [33] A. Agresti and B. Finlay, Statistical Methods for the Social Sciences, 3rd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1997, ch. 8. [34] H. T. Reynolds, Analysis of Nominal Data, Series: Quantitative Applications in the Social Sciences, 2nd ed. Thousand Oaks, CA, USA: Sage Publ., 1984, pp. 15–60. [35] L. A. Kurgan and K. J. Cios, “CAIM discretization algorithm,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 2, pp. 145–153, Feb. 2004. [36] C. J. Tsai, C. I. Lee, and W. P. Yang, “A discretization algorithm based on class-attribute contingency coefficient,” Inf. Sci., vol. 178, no. 3, pp. 714– 731, Feb. 2008. [37] L. Bruzzone and L. Carlin, “A multilevel context-based system for classification of very high spatial resolution images,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 9, pp. 2587–2600, Sep. 2006. [38] J. Y. Ching, A. K. C. Wong, and K. C. C. Chan, “Class-dependent discretization for inductive learning from continuous and mixed model data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 7, pp. 641–651, Jul. 1995. [39] G. I. Webb, J. R. Boughton, and Z. Wang, “Not so naive Bayes: Aggregating one-dependence estimators,” Mach. Learn., vol. 58, no. 1, pp. 5–24, Jan. 2005. [40] H. Mark, F. Eibe, H. Geoffrey, P. Bernhard, R. Peter, and H. W. Ian, “The WEKA data mining software: An update,” SIGKDD Explorations, vol. 11, no. 1, pp. 10–18, Jun. 2009.

[41] A. K. C. Wong and D. K. Y. Chiu, “Synthesizing statistical knowledge from incomplete mixed-model data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-9, no. 6, pp. 796–805, Nov. 1987. [42] X. Chen, T. Fang, H. Huo, and D. Li, “Graph-based feature selection for object-oriented classification in VHR airborne imagery,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 1, pp. 353–365, Jan. 2011. [43] Y. Yang and G. Webb, “On why discretization works for naive-Bayes classifiers,” in Proc. Adv. Artif. Intell., vol. 2903, Lecture Notes Comput. Sci., 2003, pp. 440–452. [44] G. M. Foody, “Thematic map comparison: Evaluating the statistical significance of differences in classification accuracy,” Photogramm. Eng. Remote Sens., vol. 70, no. 5, pp. 627–633, May 2004.

Bo Wu received the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2006. From 2007 to 2008, he was a postdoctoral research fellow at The Chinese University of Hong Kong, Shatin, Hong Kong. In September 2008, he joined the Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Fuzhou University, Fuzhou, China, as an Associate Professor. His current research interests include image processing, spatiotemporal statistics, and machine learning, with applications in remote sensing.

Liangpei Zhang (M’06–SM’08) received the B.S. degree in physics from Hunan Normal University, Changsha, China, in 1982, the M.S. degree in optics from the Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an, China, in 1988, and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 1998. He is currently the Head of the Remote Sensing Division, State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University. He is also a “Chang-Jiang Scholar” Chair Professor appointed by the Ministry of Education of China. He is currently a Principal Scientist for the China State Key Basic Research Project (2011–2016) appointed by the Ministry of National Science and Technology of China to lead the remote sensing program in China. He has more than 260 research papers. He is the holder of five patents. His research interests include hyperspectral remote sensing, high-resolution remote sensing, image processing, and artificial intelligence. Dr. Zhang is a Fellow of the Institution of Engineering and Technology, an Executive Member (Board of Governor) of the China National Committee of International Geosphere–Biosphere Programme, an Executive Member of the China Society of Image and Graphics, etc. He regularly serves as a Cochair of the series SPIE Conferences on Multispectral Image Processing and Pattern Recognition, Conference on Asia Remote Sensing, and many other conferences. He edits several conference proceedings, issues, and geoinformatics symposiums. He also serves as an Associate Editor of the International Journal of Ambient Computing and Intelligence, International Journal of Image and Graphics, International Journal of Digital Multimedia Broadcasting, Journal of Geo-spatial Information Science, and Journal of Remote Sensing. He is an Associate Editor of the IEEE T RANSACTIONS ON G EOSCIENCE AND R EMOTE S ENSING.

Yindi Zhao received the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2006. She is currently an Associate Professor with the School of Environment Science and Spatial Informatics, China University of Mining and Technology, Xuzhou, China. Her research interests include high spatial resolution imagery analysis, hyperspectral data processing, and pattern recognition.