20 Dec 2007 - characteristic is to minimize the variance of a given class label of all ... A very common task in remote sensing applications is image ... a high dimensional feature space only if this function is continuous, ... available systematic procedure suitable for finding an appropriate ..... 5) Randomly fill the matrix Bh.
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
1
Supervised Classification of Remotely Sensed Imagery using a Modified k -NN Technique Luis Samaniego, Andr´as B´ardossy, and Karsten Schulz
Abstract Nearest neighbor (NN) techniques are commonly used in remote sensing, pattern recognition and statistics to classify objects into a predefined number of categories based on a given set of predictors. These techniques are especially useful in those cases exhibiting highly nonlinear relationship between variables. In most studies the distance measure is adopted a priori. In contrast, we propose a general procedure to find Euclidean metrics in a low dimensional space (i.e. one in which the number of dimensions is less than the number of predictor variables) whose main characteristic is to minimize the variance of a given class label of all those pairs of points whose distance is less than a predefined value. k-nearest neighbor (k-NN) is used in each embedded space to determine the possibility that a query belongs to a given class label. The class estimation is carried out by an ensemble of predictions. To illustrate the application of this technique, a typical land cover classification using a Landsat 5-TM scene is presented. Experimental results indicate substantial improvement with regard to the classification accuracy compared with approaches such as maximum likelihood (ML), linear discriminant analysis (LDA), standard k-NN, and adaptive quasiconformal kernel k-nearest neighbor (AQK). Index Terms Land cover classification, k-nearest neighbors, simulated annealing, dimensionality reduction, ensemble prediction.
I. I NTRODUCTION Products derived from remotely sensed imagery have become a cost-effective source of high resolution spatiotemporal information which is currently required by most environmental and climatic models as well as to carry out diverse planning and monitoring activities [1]. A very common task in remote sensing applications is image classification, whose most common products include: land cover maps; assessment of deforestation and burned forest areas; crop acreage and production estimation; and pollution monitoring [2]. Image classification is also commonly applied in other contexts such as optical pattern and object recognition [3]. As a result, a number of supervised classification algorithms have been developed in the last years to cope L. Samaniego and K. Schulz are with the UFZ - Helmholtz-Centre for Environmental Research, Leipzig, Germany. A. B´ardossy is with the Institute of Hydraulic Engineering, University of Stuttgart, Germany.
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
2
with both the increasing demand for these products and the specific characteristics of a variety of scientific and industrial problems. In remote sensing, in particular, supervised classification algorithms are based on statistical and computational intelligence frameworks [4], [5]. Classical examples include: Gaussian maximum likelihood (ML) classifier [6], fuzzy-rule based techniques [7], [8], fuzzy decision trees [9], Bayesian and artificial neural networks [5], [10], [11], support vector machines [12], [13] and the k-Nearest Neighbor (k-NN) algorithm [14]–[18]. In general, a supervised classification algorithm consists of two phases: 1) the learning phase, in which the algorithm identifies a classification scheme based on spectral signatures obtained from “training” sites having known class labels (e.g. land cover types), and 2) the prediction phase, in which the classification scheme is applied to other locations with unknown class membership. The main distinction between the algorithms mentioned above is the procedure to find relationships– e.g. “rules”, “networks”, or “likelihood” measures – between the input or predictor space and the output or class label so that either an appropriate discriminant function is maximized [3] or a cost function accounting for misclassified observations is minimized [19]–[23]. In other words, they follow the traditional modeling paradigm that attempts to find an “optimal” parameter set which closes the distance between the observed attributes and the classification response [7], [24]. In this paper, we do not propose a new supervised classification algorithm but rather a technique to ameliorate some of the known shortcomings of k-NN, and as a result increase the classification accuracy. Hereafter, this technique will be referred to as the Modified Nearest Neighbor (MNN) algorithm. Our approach is based on the k nearest neighbor rule because it is an extremely flexible and parsimonious—perhaps the simplest [14]—supervised classification technique which requires only one parameter, namely: the number of nearest neighbors k [2]. It is also very attractive for operational purposes because it does neither require preprocessing of the data nor assumptions with respect to the distribution of the training data [16]. Moreover, it is a quite robust technique because the single 1-NN rule guarantees an asymptotic error rate at the most twice that of the Bayes probability of error [14]. k-NN has, however, several shortcomings that have been already addressed in the literature. For instance, 1) its performance highly depends on the selection of k; 2) pooling nearest neighbors from training data that contain overlapping classes is considered unsuitable; 3) the so-called curse of dimensionality can severely hurt its performance in finite samples [3], [16], [25]; and finally 4) the selection of the distance metric is crucial to determine the outcome of the nearest neighbor classification [25]. Several analytical frameworks have been conceived to address these issues which affect the performance not only of k-NN but of all supervised classification algorithms. Let us start with a framework which has received a lot of attention in the last years. It comprises a number of algorithms whose common characteristic is the use of a kernel function to map the predictor space into a much higher dimensional feature space in which a new adaptive metric can be defined. The main purpose of this transformation is to obtain neighborhoods with homogeneous conditional class probabilities [25]. At first glance, this mapping to a high dimensional space appears to lead to intractable problems. Fortunately, this is not the case for some feature spaces that fulfil the Mercer’s theorem, which states that “any kernel function can be expressed as a dot product in a high dimensional feature space only if this function is continuous, symmetric, and positive semi-definite” [26]. December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
3
This kind of mapping has practical applications, for instance: supervised classification algorithms such as kernel Fisher discriminant [27] or support vector machines [28] use these properties of the kernel function to transform a nonlinear classification problem from the predictor space into a linear classification problem in the feature space [3]. Other algorithms such as adaptive quasiconformal kernel nearest neighbor classification (AQK) [25] also take advantage of Mercer’s theorem. Possible disadvantages of all those methods seeking a higher dimensionality feature space are the significative increase in time complexity and the risk of overfitting. Another group of algorithms emphasize the importance of an adaptive local metric whose main goal is shrinking the search of nearest neighbors in those directions in which the class centroid does not coincide while stretching it along the true decision boundary. The local metric is normally found in a transformed space having the same dimensionality as the predictor space. A classical example of this kind of algorithm is the adaptive nearest neighbor classification [16]. Recently, algorithms seeking to reconstruct the manifold coordinate system in a lower dimensional space have caught the attention of researchers in several fields due to the widespread availability of high-dimensional data. Classical algorithms belonging to this kind of dimensionality reduction framework are principal component analysis (PCA) [29] and multidimensional scaling (MDS) [30]. Both algorithms are aiming for a low-dimensional embedding space but they differ in the way they obtain it. The former finds an embedding space that best preserves the predictor’s variance as measured in the high-dimensional input space, while MDS finds an embedding that preserves the interpoint distances. These two methods, however, may fail to uncover nonlinear features present in the data set, and consequently tend to overestimate the dimensionality of the data set [31]. Isometric mapping (ISOMAP) [22], [31], on the contrary, was proposed to amend these shortcomings. Asymptotically, it is able to recover the true dimensionality of the embedded nonlinear manifold as well as its geometric structure. This algorithm, however, may become computationally intractable for large data sets [22]. Feature and band extraction methods [23], on the contrary, search for lower dimensional space in which each component is obtained as the average of a given set of predictors. The MNN algorithm proposed in this study is a hybrid algorithm that combines algorithmic features of a dimensionality reduction algorithm and the standard k-NN. The main difference between MNN and the algorithms mentioned above stem from the imposed condition to find embedding spaces and their corresponding metrics. In MNN, the basic condition to find an embedding space is that the cumulative variance of a given class label for a given portion of the closest pairs of observations should be minimum. As a result, observations that share a given class label up to a certain range (say up to 10% of the maximum distance) are very likely to be grouped together while the rest tend to be farther away. In this way, MNN intends to produce neighborhoods where the class conditional probabilities tend to be homogeneous [25]. Thus, the embedding condition in MNN is class specific and not global as in other methods described above. In the following sections, we present a detailed description of the proposed method as well as results from its application in a test site in Germany.
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
4
II. P ROPOSED MNN METHOD A. Motivation for finding an embedded space An important characteristic of the proposed method, in contrast to the conventional nearest-neighbor approaches, originates in the way the distance between observations is estimated. Traditionally k-NN requires a priori definition of a metric in the predictor space x [14]. A variety of reasonable distances is usually tested because the selection of the distance affects the performance of the classification algorithm [16], [32]. Common examples from the literature are the L 1 Norm (or Manhattan distance), the L 2 Norm (or Euclidean distance), L ∞ norm (or Chebychev distance), the Mahalanobis distance [33], and similarity measures such as the Pearson-, and the Spearman-Rank correlation. Split-sample comparisons may help during the selection of an appropriate distance, but there is no available systematic procedure suitable for finding an appropriate distance metric. Moreover, all these norms are sensitive to scaling in the x-space. In MNN, on the contrary, we search for a metric in a lower dimensional space or embedding space which is suitable for separating a given class. In essence, MNN finds a nonlinear transformation (or mapping) that minimizes the variability of a given class membership of all those pairs of points whose cumulative distance is less than a predefined value (D). It is worth noting that the this condition does not imply finding clusters in the predictor space that minimize the total intra-cluster variance, which is the usual premise in cluster analysis (e.g. in k-means algorithm) [6]. A mapping fulfilling this condition is schematically depicted in Fig. 1. In this example, a hypothetical classification problem consisting of two classes in two dimensions (say {x 1 , x2 }), one of which almost completely surrounds the other was used [16]. A lower dimensional embedding space is preferred in MNN, if possible, because of the empirical evidence supporting the fact that the intrinsic manifold dimension of the multivariate data is substantially smaller than that of the inputs [16], [22]. A straightforward way to verify this hypothesis is calculating the dimensionality of the covariance matrix of the standardized predictors and to count the number of dominant eigenvectors (For more details on this technique, the reader is referred to [29]). A fundamental issue to be considered during the derivation of an embedding space is the effects of the intrinsic nonlinearities present in multivariate data sets (e.g. hyperspectral imagery). It is reported in the literature that the various sources of nonlinearities affect severely the land cover classification products [22], [34]. This in turn would imply that dissimilar class labels might depend differently upon the input vectors. Consequently, it is better to find class specific embeddings and their respective metrics rather than a global embedding. B. Problem formulation Supervised classification algorithms aim at predicting the class label h ∈ {1, . . . , l} – from among l predetermined classes – that correspond to a query x 0 composed of m characteristics, i.e. x 0 = {x1 , . . . , xm }0 ∈ Rm . To perform this task, a classification scheme c “ascertained” from the training set T = {(x i , ei ) :
i = 1, . . . n} is required.
Here, to avoid any influence from the ordering, the membership of the observation i to the class h is denoted by a unit vector e i ∈ {eh :
December 20, 2007
h = 1, . . . , l} [e.g. class 2 is coded in a three-class case as e 2 = {0, 1, 0}]. Moreover,
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
5
u= B1 [x1,x2]
2
x2
2 1 2 2 2 221 2 2 2 22 2
2
u2
2
2 2 22 22 2
2 22 2 2 1 22 211 2 2 2 1 2 1 2 2 2 1 1 2 1 2 1 1 1 1 1 1 11 1 11 1 1 2 2 12 1 1 11 1 111 1 22 22 2 22 2 2 1 11 1 1 22 11 1 1 1 2 2 2 1 11 11 2 1 1 1 111 11 111 1 2 1 1 11111111 11 11 2 1 1 1 22 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 2 1 222 1 1 2 2 2 1 2 2 2 2 1 2 1 12 2 2 1 22 2 2 22 21 2 2 21 2 2 2 2 22 2
2 2 2 222 2 2 2 22 22 2 2 2 2 2 22 22 22 2 2 2 1 2 2 2 2 2 2 2 2 2 211 222 2 2 22 2 2 2 1 22 2 22 2 2 2 2 2 2 222121 2 2 2 2 222 222 22 22 21 2 2 2 1 2 2 2 2 2 2 2 12 1 22 1 2 1 2 1 2 11 1 1 111 1111 1 2 1 21 1 22 D 11 11 1 1 j 1 11 1 d ij 111 i 1 11 1 1 1 1 1 11 111 1 1 1 1 1 1 11 11 1 111 1 1 1 11 1 11 1111 1
u1
x1 Fig. 1.
The left panel shows the distribution of the observations in the predictor space. The right panel depicts the mapping x −→ B1 [x]
found to separate as much as possible the class 1 from the class 2. The main characteristic of this mapping is that it minimizes the cummulative variance of the class label e1 for all those pairs having at least one observation (i, j) belonging to class 1 and whose distance d is less than D. The class label ei1 = 1 if the observation i belongs to class 1, otherwise ei1 = 0.
each class should have at least one observation, i.e. n h > 0, and
l h=1
nh = n, where n is the sample size of the
training set. In general, the proposed MNN algorithm estimates the classification response y 0 = c(x0 ) = {y1 , . . . , yl }0 as a combination of selected membership vectors e i = {ei1 , . . . , eil } whose respective “transformed” inputs B h [xi ] are the k-nearest neighbors of the transformed value of the query B h [x0 ]. Since there are l possible embedding spaces in this case, then l possible neighborhoods of this query have to be taken into account in the estimation (Fig. 1). Here Bh denotes a class specific transformation function. Based on the considerations explained above, a stepwise algorithm is suggested for the estimation of the classification response: 1) An attribute h is selected. 2) An optimal distance d h for the binary classification problem is identified. 3) The binary classification is performed using k-NN with the distance d h leading for each vector x 0 a partial membership vector {y h }0h , h = 1, . . . , l. 4) Steps 1) to 3) are repeated for each attribute h. 5) The partial memberships are aggregated into the final classification response y 0 . 6) The class exhibiting the highest class membership value is assigned to x 0 .
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
B1
6
2 2
1
c
u2
21
1
1
h
b u0
1 1
1
1 1
2 22
a
1
1 u1
B2
1 1
1
e 1
u2
1
u0
1
2
d 1
Σ
NN
Membership
Response
e
y
a
1
0
b
1
0
c
1
0
d
1
0
e
1
0
f
0
1
y01
y02
y0
5/6 1/6 1
0
y0 (crisp)
2 1
f
2 2
2 u1
Fig. 2. Schematic description of the estimation of the classification response y0 using a simple example with three neighbors in each transformed space B1 and B2 respectively. Note that the neighborhood of the query u0 = Bh [x0 ] depends on the transformation.
A schematic representation of the algorithm is depicted in Fig. 2. To apply this algorithm, however, h optimal metrics have to be identified observing some conditions. A distance measure d h is considered suitable for the binary classification of a given attribute h if the following discriminating conditions are met: (a)
The distances in the transformed space between pairs of points in which both correspond to the selected attribute are small, and
(b)
The distances in the transformed space between pairs of points in which only one of them correspond to the selected attribute are large.
These conditions ensure that the k-NN classifier works well at least for a given attribute in its corresponding embedded space. In other words, it is expected that the transformation would produce neighborhoods where the class conditional probabilities are homogeneous [14], [25]. This effect can be visualized in Fig. 1. In this respect, MNN has a strong similarity with adaptive kernel based NN algorithms, e.g. adaptive quasiconformal kernel nearest neighbor algorithm (AQK) [25] or discriminant adaptive nearest neighbor classification [16]. The procedure to find such a homogeneous neighborhood, however, differentiates MNN from these types of algorithms as will be apparent below. The distance between two points in MNN is defined as the Euclidean distance between their transformed coordinates [16] dBh (ui , uj ) = Bh [xi ] − Bh [xj ]
(1)
where Bh is a transformation (possibly nonlinear) of the m-dimensional space x into another κ-dimensional space December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
7
uh usually with (κ ≤ m): uh = Bh [x]
(2)
As a result, MNN requires finding a set of transformations B h so that the corresponding distance d h performs well according to the conditions (a) and (b) mentioned above. In the present study, this was attained by a local variance function G Bh (p) [35] that accounts for the increase in the variability of the class membership e h with respect to the increase in the distance of the nearest neighbors in a nonparametric form. Formally, GBh (p) can be defined for 0 < p ≤ 1 as [24] GBh (p) =
1 N (p)
(eih − ejh )2
(3)
dBh (i,j) i} ∀h ∈ {1, . . . , l}
(4)
Here Nh denotes the cardinality of the set C h given by Nh = |Ch | =
nh (2n − nh + 1) 2
(5)
and N (p) refers to the number of pairs (i, j) ∈ C h that have a dBh (i, j) smaller than the limiting distance D B (p), formally defined as N (p) = |{(i, j) ; dBh (i, j) < DB (p) ∧ (i, j) ∈ Ch ∀h}|
(6)
where |{.}| denotes the cardinality of a given set. The transformation B h can be identified as the one which minimizes the integral of the curve G Bh vs. p up to a given threshold proportion p ∗ , subject to the condition that each G Bh is a monotonic increasing curve (see Fig. 3). In this case, the estimator is justifiable considering that the larger the number of neighbors, the greater the distances between pairs, and hence, the larger their cumulated variability of the class memberships. Put differently, close neighbors should have the same class membership, whereas further away ones should belong to different classes. Formally, this can be written as follows: Find Bh ,
h = 1, . . . , l
so that
p∗
min 0
GBh (p)dp
(7)
subject to GBh (p) ≤ GBh (p + δ) ∀ p, δ > 0
(8)
where δ is a given interval along the axis of p. It is worth noting that the objective function denoted by (7) will be barely affected by outliers (i.e. those points whose close neighborhood is relatively quite far), which appear
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
8
1
Cumulative variance G(p)
Original data (x space)
Transformed data u = B1(x1,x2) (u space)
G(pq)
Minimized area 0
0
p1
pq
p*
100
Cumulative distance (p×100) [%] Fig. 3.
Effect of the transformation B1 on the values of the curve GB1 using the hypothetical data shown in Fig. 1. It is worth noting how
this transformation homogenizes the neighborhoods (zero variance) of those pairs separated by small distances (lower left corner).
often in unevenly distributed data. For computational reasons, (7) can be further simplified to Q
GBh (pq ) → min
(9)
q=1
where {pq : q = 1, . . . , Q} is a set of Q portions such as p q−1 < pq < 1, ∀q, and pQ = p∗ (see Fig. 3). The total number of pairs within the class h (see Fig. 3) can be defined as N h∗ = |{(i, j) (ei = eh )∧(ej = eh ) ∧ j > i}| ∀h, thus this threshold for the class h can be estimated by ph =
Nh∗ nh − 1 = Nh 2n − nh + 1
(10)
In summary, this method seeks different metrics, one for each class h, on their respective embedding spaces defined by the transformations B h so that the cumulative variance of class attributes for a given portion of the closest pairs of observations is minimized. Consequently these metrics are not global. An example of the effects of the optimization is shown in Fig.3. The approach to find class specific metrics in different embedding spaces is what clearly differentiates MNN from other algorithms such as the standard k-NN or AQK. In contrast to MNN, k-NN employs a global metric defined in the input space [14], whereas AQK uses a kernel function to find ellipsoidal neighborhoods [16]. The kernel, in turn, is applied to find implicitly the distance between points in a feature space whose dimension is larger than the input space x [25]. C. Selection of an appropriate embedding space There are infinitely many possibilities of selecting a transformation B h . Among them, the most parsimonious ones are the linear transformations [24]. Here, for the sake of simplicity, l matrices of size [κ × m] are used: December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
9
uh = Bh x h = 1, . . . , l
(11)
It is worth mentioning that the k-NN method is similar to the proposed MNN if B h = diag(1, . . . , 1). Appropriate transformations B h have to be found using an optimization method. The Generalized reduced gradient (GRG) method might be an option to solve this optimization problem, but due to the combinatorial nature of this problem [see (9)], it tends to find suboptimal solutions as a result of the large variability of the Hessian. Therefore, the following algorithm based on simulated annealing [36] is proposed: 1) Select a set of threshold proportions p 1 < p2 < . . . < pQ < 1. The value of p Q can be chosen to be equal k n
with k being the number of the nearest-neighbors to be used for the estimation. For the other proportions,
pq =
q Q pQ
is a reasonable choice.
2) Select h = 1, . . . , l. 3) Select the dimension of the u h space κ. 4) Fix an initial annealing temperature t. 5) Randomly fill the matrix B h .
GBh (pq−1 ) 6) Check the monotonic condition using: W h = Q q=2 max(1, GBh (pq ) ). Q 7) Calculate the objective function O = W h q=1 GBh (pq ). 8) Randomly select a pair of indices (i 1 , i2 ) with 1 ≤ i1 ≤ κ and 1 ≤ i2 ≤ m. ∗ 9) Randomly modify the element b i1 ,i2 of the matrix B h and formulate a new matrix B h. Q GB ∗ (pq−1 ) 10) Calculate the penalty for non-monotonic behavior: W h∗ = q=2 max 1, G h ∗ (pq ) B h J 11) Calculate Oh∗ = Wh∗ q=1 GBh∗ (pq ). ∗
12) If O ∗ ≤ O then replace Bh by B∗h . Else calculate R = exp( O−O ). With the probability R, replace B h by t B∗h . 13) Repeat steps (8)-(12) M times (with M being the length of the Markov chain [36]). 14) Reduce the annealing temperature t and repeat steps (8)-(13) until the objective function O achieves a minimum. In case a nonlinear function B h would be required, the proposed algorithm is not affected, in general, because the coefficients of any nonlinear function can be associated with a matrix B h = [bητ ]h . As a result, the generation of new solutions can be done as proposed in steps 8) to 9). The size of the matrix, however, might be slightly modified to meet the new requirements. Depending on the nonlinear function, the evaluation of the components u hι , may not be possible by matrix algebra but by explicit algebraic evaluation. For example, if a polynomial equation, such as that used for the transformation in Fig.1, is used, then the ι component of the embedding space for a given class h can be written as
uιh =
m m η=1 τ =1
December 20, 2007
bητ h xη xτ +
m
b(m+1)τ hxτ
(12)
τ =1
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
10
where h = 1, . . . , l, ι = 1, . . . , κ, and η, τ = 1, . . . , m. (The index of the observation is not included to ease the notation.) D. Algorithm complexity The computational complexity of the proposed algorithm is determined by two main processes: 1) the estimation of the function defined in (3), and 2) the minimization problem stated in (7). The first process consists mainly in calculating the distances among all points belonging to the set C h and then finding a permutation that rearranges them in ascending order. Using an efficient sorting algorithm (e.g. the Singleton’s algorithm [37]) the combined computational complexity scales as O(N h log(Nh )), which in the worst case, can be simplified to O(n 2 log(n)). For comparison, the complexity of k-NN scales as O(κN h ). The second component is an optimization problem, which produces all elements of the matrix B h in eq. 2 for a given class h. In the present study, this problem is solved with simulated annealing whose processing time scales as O(M log(n)) where n denotes the size of training set and M the length of the Markov chain (see algorithm, step 13) [38]. E. Definition of the estimator The classification response y0 corresponding to a query x 0 is found here with an ensemble of predictions (one for each transformation B h ) to ensure interdependency between all considered classes and to utilize the respective merits of each single estimation [12]. This characteristic of the classifier was already introduced during the selection of the matrices Bh by means of the composition of the set C h defined in (4). In other words, the classification response is estimated as the mean of the respective class memberships of the k nearest neighbors of the query found in each transformed space. Formally it can be written as y0h
=
l 1 ejh lk j h=1
j h
∈
{j :
dBh (0, j ) < Dh (k) j = 1, . . . , n}
= 1, . . . , l
(13)
where Dh (k) is the distance of the k-nearest neighbor of observation x 0 in the transformed space u h . The modus operandi of the proposed estimation procedure is depicted in Fig.2 using part of the example shown in Fig.1. Other estimators are also possible [7]. The advantage of this approach is that one can always find an estimate for a given observation. Its main disadvantage, however, occurs for those points that do not have close enough neighbors (i.e. false neighbors), which, in general lead to an erroneous estimation. As a result of (13), the classification vector y 0 has values in the interval 0 ≤ y h0 ≤ 1 h = 1, . . . , l and l h=1 yh0 = 1, in other words, this algorithm delivers a soft classification that could be understood as the grade of membership of the observation x 0 to the class h [7]. If a hard classification is required, then the class having the highest y value can be assigned to the given observation x 0 . December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 8°30'0"E
9°0'0"E
11 9°30'0"E
48°40'0"N
Legend Land cover classes Class 1 Class 2
48°30'0"N
Class 3 Class 4 Neckar
48°20'0"N
Upper Neckar Basin km
48°10'0"N
0
10
20
Germany
Stuttgart
Background: Landsat 5 TM, Band 4, July 2003
9°30'0"E
30
Fig. 4. Location of the study area and the land cover map obtained with the MNN classifier using k = 10 neighbors. The size of the training set was nh = 270 per class.
III. DATA AND CLASSIFICATION METHODS For this study a scene acquired by the Landsat 5-TM sensor on July 2003 (path: 195, row: 26) was used to derive a land cover map for the upper catchment of the Neckar River, near Stuttgart in Germany, with an area of approximately 4000 km 2 (Fig. 4). The upper Neckar catchment is bounded by the north-western edge of the Swabian-Jura on the right bank side of the Neckar River and by the Black Forest on its left bank. A land cover map for this area is required for assessing the hydrological consequences of the land use/cover change. Four land cover classes exhibiting different hydrological behavior could be identified from aerial photos and topographic maps, namely: a) Class 1 (h = 1) – composed of permeable areas covered by coniferous, deciduous, and mixed forest. b) Class 2 (h = 2) – mainly composed of impervious areas with land usages such as settlements, industrial parks, highways, airport runways, and railway tracks. c) Class 3 (h = 3) – mainly composed of permeable areas such as wetlands, fallow lands, or those surfaces covered by crops, grass, and orchards. Finally, d) Class 4 (h = 4) – composed of areas covered by rivers, lakes and reservoirs. The TM sensors are multispectral scanning instruments whose wavelength ranges from the visible, through the mid-infrared, into the thermal-infrared portion of the electromagnetic spectrum with a spatial resolution of 30 m. Using principal component analysis it was determined that the combination of three eigenvectors of the covariance matrix of the standardized spectral wavebands is enough to explain 95% of the total variance of the training sample.
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
12
For the sake of simplicity, six wavebands denoted as {x 1 , . . . , xm }, m = 6 were used for the subsequent analysis. They include band 1 [0.45-0.52] μm, band 2 [0.52-0.60] μm, band 3 [0.63-0.69] μm, band 4 [0.76-0.90] μm, band 5 [1.55-1.75] μm, and band 7 [2.08-2.35] μm respectively. The thermal band (6) was not included. The atmospheric correction of the surface reflectance of each waveband was not required in this study because only a mono-temporal scene was used [39]. Various researchers using empirical data have found that regardless of the classifier the overall accuracy is positively correlated with the size of the training sample n [12], [40]–[42]. Due to this fact, a number of guidelines have been proposed in the literature to ensure that a classification algorithm does not lead to a biased estimation of class labels. One of these rules relates the sample size with the dimensionality of the predictors and suggests that a typical sample size should be between 10 to 30 observations per waveband and per class [12], [43], [44]. This heuristics is particularly relevant when the classifier is based on the assumption that the classes are normally distributed as is the case with ML. Here, several disjoint sets were sampled without replacement from the data mentioned above using stratified random sampling. Sampling on the boundaries of the respective classes was avoided to ensure that the class label of targeted classes are discrete. The largest sample size of the training set was fixed to n h = 270 observations per class to be on the safest size of the heuristic mentioned above. Additionally, four calibration sets were also sampled to perform sensitivity analysis of the impact of the sample size on the efficiency of the classifiers, thus, the number of observations per class were fixed to n h ∈ {30, 60, 90, 180, 270} respectively. The validation set was unique for all split sampling experiments and had n h = 300 observations per class. In this study three other classifiers were used to compare the performance of MNN. They were the standard knearest neighbors (k-NN), maximum likelihood (ML), adaptive quasiconformal kernel nearest neighbors(AQK), and Fisher’s linear discriminant analysis (LDA). k-NN, ML, AQK, and LDA were applied according to their standard formulations found in literature [6], [25], [29] using the six predictors mentioned before. AQK was setup with a radial kernel function and conditional class probabilities were estimated with Gaussian kernels as suggested in [25]. For the application of the MNN algorithm presented in Section II-C, some parameters have to be set up. The most important of them is the maximum limit of the integration required in (9). This parameter was set to p Q = 0.3 for all training samples so that the cumulative variability reached at least 80% of the total. IV. E VALUATION OF THE CLASSIFICATION ACCURACY AND COMPARISON A thorough assessment of the classification accuracy is an essential element that should be provided along with any product derived from remotely sensed imagery (e.g. a land cover map). In the present case, considering that the parameters of each transformation function B h are estimated from a random sample, it is also necessary to validate their performance with an independent set of observations. Possible validation methods are leave-one-out cross-validation and split sample testing. The former is often used when only a small sample is available. In this case, x0 ∈ T and the calibration of the parameters is carried out with the remaining observations of the training set. The latter, on the contrary, requires a disjoint set (V, T ∩ V ≡ ∅) usually called the validation set. In this case December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
13
the query x0 ∈ V. The cardinality of V is denoted by v = |V| and could be greater than that of the training set. There are many classification accuracy measures reported in the literature, the most widely used ones are derived from the error or confusion matrix [6], [45], [46]. This matrix is a contingency table that tabulates the predicted and the actual class memberships contained in the validation set. The elements in the principal diagonal of this matrix represent the correct matches and all the remaining elements constitute the mismatches. Several measures can be estimated from that matrix. For instance, 1) the overall accuracy OA which represents the proportion of the total number of predictions that were correctly classified; 2) the producer’s accuracy (PA) (1 - error of omission in percentage) which estimates the probability that a given “ground truth” reference is correctly classified; and 3) the user’s accuracy (UA) (1 - error of commission in percentage), which, in contrast to the latter, estimates the probability that a classified observation matches its “ground truth” reference. For more details on this technique, the reader is referred to [20], [45], [47]. An advantage of the confusion matrix method is that it provides single accuracy indexes which can be used for further evaluation and comparison. These statistics, however, are subject to a number of sampling errors mainly originated by defects in the reference data, its geographical position, and the size of the mapping unit. Consequently, it is necessary in a statement of classification accuracy to perform a rigorous statistical analysis of the uncertainty of these measures [48], for example, providing confidence intervals for these measures (say at the 95% level) [6] or other statistics such as the obtained maximum and minimum values [12]. Confidence intervals can be based on parametric statistics theory as suggested in [6], [42], [43] or by nonparametric techniques such as Bootstrap. The latter was used in this study because it does not require any assumption with respect to the probability density function of the statistic of interest, which is unknown a priori, and hence, cannot be assumed Gaussian. The main disadvantage of these measures stems from the fact that they are obtained from a confusion matrix based on a “hard” classification. It is acknowledged by several authors [20], [47] that this prerequisite affects the quality of the accuracy measures, especially in those situations of known uncertainty in the ground data (i.e. the classes are fuzzy) and the classification is hard, or when the classifier delivers a “soft” classification and the classes have crisp memberships values (e.g. MNN). A number of approaches have been suggested to compensate for this deficiency of the error matrix, among them, the Shannon entropy [20] and a measure of ambiguity [7]. The Shannon entropy H is a dimensionless measure of uncertainty based on probability theory, which indicates the degree to which the class membership probabilities are partitioned between the defined classes [20]. Consequently, entropy is maximized in the situation when membership is evenly distributed among all defined classes, and minimized when membership is entirely associated with a unique class. The Shannon entropy of a discrete random variable e can be estimated based on the class membership probabilities P (y h ) by H(e) = −
P (yh ) log2 P (yh )
(14)
h
Additionally, the magnitude or seriousness of the error can be estimated by the measure of ambiguity b(y 0 ) [7], [47] which indicates whether the classifier c has yielded a crisp response. In other words, the smaller the ambiguity
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
14
index obtained for a query, the clearer the response of the classifier. Formally, this ambiguity index is estimated by b(y0 ) = 1 − max yh0 h
(15)
Considering that the present study introduces a new technique to improve the performance k-NN algorithm, it is necessary to perform a rigorous statistical test to establish whether the classification accuracy obtained by MNN is significantly different from those obtained with other standard algorithms. The McNemar’s test (1947) is a suitable non-parametric test to perform this statistical analysis because it was conceived to test the null hypothesis of “marginal homogeneity” of a 2 × 2 contingency table with a dichotomous attribute denoting the level of interclassifier agreement into correct and incorrect allocations. This test is, therefore, focused on those cases correctly classified by one algorithm but misclassified by the other [42], here denoted by b and c respectively. The McNemar’s statistic (without correction for continuity) can be estimated as b−c Z= √ b+c
(16)
The square of the McNemar’s statistic is χ 2 -distributed with one degree of freedom. The null hypothesis can be rejected if |Z| ≥ 1.95 with the resulting conclusion that there is a statistically significant difference between two classifiers at the 95% confidence level. A positive value of this statistic indicates that the first named classifier had the higher accuracy. V. R ESULTS AND DISCUSSION The boxplots shown in Fig.5 are a convenient summary of the classification accuracy obtained with 100 independent training sets composed each of n h = 270 observations per class. Based on these simulations, it was established that the highest classification accuracy was archived with the MNN using linear transformations of size [3 × 6]. The classification accuracy of MNN was 97.8 ± 0.6% at the 95% level of confidence. This simulations were carried out to verify that the results obtained were not optimistically biased due to the parametrization adopted in a particular classifier and/or due to possible artifacts present in the data. Moreover, this graph shows that MNN is quite resistent to outliers and consequently, the range of variation of its classification accuracy is significantly smaller than that of the other methods employed at the 95% level of confidence. LDA and AQK achieved also very good classification accuracies, but they appeared to be more susceptible than MNN to outliers present in the training set. k-NN and ML produced the worst results in these experiments and, in general, appear to be quite sensitive to the composition of the training set. The error matrices obtained with each algorithm for the classification of the validation set are reported in Table I. To produce this table, all classifiers were trained with a sample whose classification accuracy was the closest to the median accuracy of the simulations. All classification methods showed a positive correlation with the size of the training set as depicted in Tables II and III. These results are in agreement with the consulted literature (e.g. see [12]). The classification accuracy of k-NN, for instance, varied from 89.2% to 93.7% as the training sample increased from 30 to 270 observations
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
15
Accuracy [%]
100
95
90
ML Fig. 5.
kNN
LDA
AQK
MNN
Boxplots of the classification accuracy obtained for each classifier over 100 simulations. The training sample size per class was
nh = 270.
per class, which corresponded to an increase of 5.0%. MNN was, on the other hand, the algorithm exhibiting the smallest sensitivity to the size of the training set regarding both the producer’s and overall accuracies because only a small increase of 0.6% was found. It is worth noting that the maximum accuracy obtained with MNN during the simulations appeared not to be sensitive to sample size as occurred with the other classifiers employed. Minimum values exhibited, however, a clear statistically significant relationship (p − value < 0.05). Moreover, the ranges of variation in accuracy were significantly reduced with MNN as shown in Table II. Being less sensitive to the size of the training set is an advantage of MNN considering that a small reduction in the training set by using an “intelligent training scheme” [42] would help to reduce considerably the costs of data collection without jeopardizing the quality of the results. Nevertheless, a generalized use of very small training sets (say nh < 10 m) is not encouraged here, because this might produce optimistically biased classifications as well as a clear increase in the ambiguity of the classification (see Fig. 8). Here m is the number of wavebands employed in the classification. All differences of accuracy with respect to training size showed in these tables are statistically significant (p − value < 0.05). In general, all experiments indicated that MNN, AQK, LDA, and k-NN performed better than ML (in this order respectively) when the sample size per class (n h ) was at least 10 m. This rule, however, was not always true for small training samples (see Table II). In addition to the sensitivity of the classifiers with respect to the size of the training set (n h ), two more sensitivity analyses were carried out in the study. Firstly, we investigated how the dimension of the embedding space κ affects the classification accuracy of MNN. Because no heuristic is available in the literature for the determination of κ, a trial-and-error approach was followed. December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
16
B[4x6] B[3x6] B[2x6]
99
Accuracy [%]
98
97
96
95 0
Fig. 6.
10
20
30
40
Number of nearest neighbors (k)
50
Relationship between the size of the embedding space κ = 2, . . . , 4 and the number of nearest neighbors obtained with MNN. The
sample size in all cases was nh = 270 per class.
κ was varied from 2 to 4 using the same sample size (n h = 270). The results of this analysis tabulated according to the number of nearest neighbors as well as the dimension of the transformation matrix B are shown in Fig. 6. For the sake of simplicity, only a linear transformation was used in this analysis, however, a similar procedure can be followed with nonlinear transformations. This figure shows that the parameter κ played a very important role on the accuracy of MNN. In the present case the best classification accuracy were obtained for κ = 3. The other two embedding possibilities (κ = 2, 4) did not perform as well as the former, and hence corroborated that κ = 3 is the best choice for the dimension of a linear embedding space using six predictors. This parameter also coincided with the dimensionality of the covariance matrix of the predictors estimated as explained in Section II. It should also be noted that the marginal increase in the overall efficiency as κ approaches m decreases rapidly and that increasing κ by one increases the number of degrees of freedom l times. Thus, there is a trade off between efficiency and the size of the embedding space. In general, one should seek lower dimensional spaces that produce robust classifications and use parsimonious transformations. A modification of the Mallows-CP [49] statistic [24] might also be used to determine the most efficient embedding dimension. Secondly, we investigated how the NN-based classifiers (MNN, AQH, and k-NN) responded to the selection of the number of nearest neighbors k required for the estimation of the class label of a query. These sensitivities are shown in Fig. 7 and all were based on the largest training sample. This graph showed that both MNN and AQK are almost independent with respect to the number of nearest neighbors, while the accuracy obtained with k-NN exhibited a clear inverse relationship with this parameter. The pairwise comparison of the accuracy statements obtained with each method and for different training set
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
17
kNN AQK MNN
Accuracy [%]
100
95
90
85
0
Fig. 7.
10
20
30
40
Number of nearest neighbors (k)
50
Relationships between the number of nearest neighbors (k) and the classification accuracy in percent for three classifiers. The sample
size per class was nk = 270.
sizes are shown in Table IV. Based on these results it was possible to reject the null hypothesis (Section IV) in favor of its alternative in all cases. Consequently, it could be concluded that the classification results obtained with MNN did not appear to be similar to those obtained with the other classifiers (i.e. AQK, LDA, k-NN, and ML) at the 5% level of significance. This table also shows that the McNemar’s statistic is somewhat sensitive to sample size as reported in literature (e.g. [12]). In general, the McNemar’s statistic between the MNN classifier and any of the other classifiers employed (Table IV) tend to decrease when the sample size increases. This tendency is not quite clear in the case between MNN and ML. These results also corroborate the capability of the MNN classifier to perform well with small sample sizes as was mentioned earlier and documented in Table II. The uncertainty associated with a given class membership was assessed with the Shannon entropy H and the ambiguity index b whose results for the largest sample are shown in Table V. In this table it can be appreciated, for instance, that both uncertainty measures are positively related. Additionally, both measures showed that classes 1 and 3 were the most heterogeneous ones, followed by classes 2 and 4 respectively. It can also be seen that ML performed quite poorly with respect to class 4 while k-NN, and MNN produced very crisp responses. These measures also confirm the existence of considerable interclass heterogeneity of classes 1 and 3, which is a consequence of their definitions (see Section III). The ambiguity index can also be used to evaluate the quality of the final land cover map and to determine those places where the classification algorithm has not rendered a crisp response [say where b(y0 ) > 0.5] as proposed in [7]. Furthermore, by plotting the density distribution of the ambiguity index and adopting a threshold [e.g. P (b ≤ 0.5) = 0.95] one can have a very reliable measure of the uncertainty of the classification (see Fig. 8), which in turn, can be used to estimate the minimum sample size per class required to reduce the ambiguity of the classification present in the final product. Finally, the land cover map of a part of the Neckar catchment contained within the Landsat scene mentioned above was produced. For the final product shown in Fig. 4, MNN using ten neighbors (k = 10) from the largest
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 1
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
0
0 0.2
0.4
0.6
0.8
1
1
0.2
0.4
0.6
1
Relative Frequency
nh = 90
0.6 0.4 0.2
0.8
0.2
0
0.2
0.6
0.8
nh = 180
0.6 0.4 0.2
1
0.4
0.6
0.8
1
b(y0)
0 0.4
0.4
1
0.8
0 0.2
0.6
b(y0)
0.8
0
nh = 30
0.8
0 0
b(y0)
1
Relative Frequency
0
Relative Frequency
1
nh = 15
Relative Frequency
nh = 5
Relative Frequency
Relative Frequency
1
18
nh = 270
0.8 0.6 0.4 0.2 0
0
0.2
b(y0)
0.4
0.6
0.8
1
0
b(y0)
0.2
0.4
0.6
0.8
1
b(y0)
Fig. 8. Probability distribution functions (pdf) of the ambiguity index b(y0 ) as a function of the sample size of the training set nh . (All pdf-s were obtained with the MNN classifier using k = 3 neighbors.)
training sample (the same set used to calculate Table I) was applied to estimated the class label of all pixels within the study area. Compared with previous classifications done for this area [7], a remarkable increase in class 2 (i.e. built-up and transportation areas) can be appreciated, which corresponds quite well with the data obtained from the Statistic year book of the State of Baden-W¨urttemberg (r 2 ≈ 0.85) at district level. A careful analysis of the spatial distribution of the uncertainty measures presented above should still be carried out to identify critical areas where the classifier performed poorly. In those situations, additional effort in ground-truth collection should be carried out. This, however, is out of the scope of this study. Each of the transformations required by MNN for each land cover class were found using the algorithm described in Section II-C. The effect of the proposed optimization algorithm is depicted in Fig.9, where class 2 (i.e. built-up and transportation areas) was clearly separated from the others in the embedding space u. The optimized parameters found for this class were
⎛
0.08
⎜ B2 = ⎜ ⎝0.01 0.20
0.03
0.03
−0.20
−0.07
−0.14
0.10
0.10
−0.09
−0.44
0.28
−0.12
0.10
0.04
⎞
⎟ −0.02⎟ ⎠
(17)
0.07
It is worth noting that each element of B 2 was constrained during the optimization within the interval [−1, 1] to improve the convergence of simulated annealing. The parameters shown in 17 can be interpreted as the degree of importance of the original predictors (i.e. wavebands) to separate a given class label. In the previous example, for instance, wavebands 2 and 3 (x 2 , and x3 respectively) were the most relevant ones for separating the class 2 from the rest. As proposed by [12], classifiers should not be considered as a competing options but rather as complementary sources of information that may help to achieve a higher classification accuracy, for instance, by setting up an ensemble-based approach to classification. Likewise, MNN has the advantage of having specific metrics and predictors for a determined class, whose contributions are aggregated to determine the most likely class label of a query. As a result, the number of false positives is considerably reduced. In the case of the MNN classifier, a false
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
19
TABLE I E RROR MATRICES DERIVED FROM ML, k-NN, LDA, AQK AND MNN CLASSIFICATIONS TRAINED WITH THE LARGEST TRAINING SET (n=1200 OBSERVATIONS ). ML Predicted
Actual class
Class
1
2
3
4
1
285
0
69
0
2
0
290
10
3
303
3
15
10
221
6
252
4
354
0
0
0
291
291
300
300
300
300
1200
Overall accuracy = 90.6%
k-NN Predicted
Actual class
Class
1
2
3
4
1
276
0
21
0
2
0
289
14
0
303
3
24
10
265
6
305
4
297
0
1
0
294
295
300
300
300
300
1200
Overall accuracy = 93.7%
LDA Predicted 1
1 2 3 4
Actual class
Class
2
3
4
277
8
13
0
0
289
6
0
295
23
0
281
0
304
298
0
3
0
300
303
300
300
300
300
1200
Overall accuracy = 95.6%
AQK Predicted
Actual class
Class
1
2
3
4
1 2
292
2
10
6
310
0
291
2
0
293
3
8
4
288
0
300
4
0
3
0
294
297
300
300
300
300
1200
Overall accuracy = 97.1%
MNN Predicted
Actual class
Class
1
2
3
4
1 2
293
0
9
0
302
1
294
3
0
298
3
6
5
288
1
300
4
0
1
0
299
300
300
300
300
300
1200
Overall accuracy = 97.8%
positive occurs when the majority of the observations within the neighborhood of a given query belong to other classes. Consequently, finding an embedding space where one class is well separated from the others would lead to minimize the number of false positives. The success of the MNN classifier stems from achieving this separation in the majority of the cases (Fig. 9). This, in turn, has led it to exhibit an average classification accuracy greater than that obtained with the other four classifiers employed in this study (Fig. 5).
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
20
TABLE II M AXIMUM AND MINIMUM OVERALL ACCURACY IN PERCENT OBTAINED WITH EACH METHOD USING TRAINING SETS OF DIFFERENT SIZE PER CLASS
Training
ML
k-NN
LDA
nh .
AQK
MNN
set size
Min
Max
Min
Max
Min
Max
Min
Max
Min
Max
30
86.1
92.8
85.8
91.9
91.7
94.4
91.8
96.1
95.1
98.2
60
87.0
92.6
88.9
92.6
92.9
95.2
94.1
96.6
97.0
98.4
90
87.8
92.3
90.1
93.0
93.3
95.7
95.7
96.9
97.5
98.6
180
87.7
92.4
91.7
93.3
93.4
97.3
96.3
97.0
97.4
98.8
270
88.3
91.9
90.9
93.8
93.7
97.7
96.2
98.1
97.6
98.7
x7
u2
Class 1 Class 2 Class 3 Class 4
x3 x2 Fig. 9.
u1 u3
Location of the validation data in the predictor’s (left) and embedding (right) spaces.
This characteristic of MNN might be of particular interest for any of the current and future global or regional observation activities, where large areas have to be (frequently) explored under limited resources. Good examples are the Global Observation of Forest and Land Cover Dynamics (GOFC-GOLD) or the Monitoring Agriculture by Remote Sensing and Supporting Agricultural Policy (MARS-CAP) projects (among many others). Finally, we would like to present a short summary of the computational requirements of MNN and some details of the optimization algorithm. As mentioned in Section II-D, the most expensive process of MNN is the optimization using simulated annealing (SA). The computational time required to find one transformation matrix with the largest training set was on average 120 min using a single node of a Linux cluster with 32 AMD Dual Core Opteron TM 885, 64-bit Processors. Parallelization of the code written in Fortran 95 using the PGI Visual Fortran TMcompiler scales the computational burden in proportion to the number of nodes employed. In this respect MNN was the slowest
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
21
TABLE III OVERALL AND PRODUCER ACCURACIES IN PERCENTAGE FOR INDIVIDUAL CLASSES OBTAINED WITH EACH CLASSIFIER CALIBRATED USING TRAINING SETS OF DIFFERENT SIZE PER CLASS nh .
ML Training
Prediction accuracy per class
Overall
Set
1
2
3
4
30
93.7
96.6
76.6
95.2
89.6
60
94.4
96.7
76.2
96.5
89.8
90
94.6
96.7
75.8
96.3
89.8
180
94.9
96.7
74.0
96.6
89.4
270
95.1
96.8
73.5
96.9
90.6
k-NN Training
Prediction accuracy per class
Overall
Set
1
2
3
4
30
88.8
95.3
80.7
98.0
89.2
60
90.3
95.5
85.1
98.0
91.1
90
91.1
96.2
85.9
98.0
91.7
180
91.7
96.3
87.7
98.0
92.5
270
92.0
96.4
88.2
98.0
93.7
LDA Training
Prediction accuracy per class
Overall
Set
1
2
3
4
30
88.9
94.4
92.2
97.8
93.3
60
90.0
95.6
91.1
98.9
93.9
90
91.1
96.7
92.2
98.9
94.7
180
92.8
96.7
92.8
99.4
95.4
270
92.3
96.3
93.7
100.0
95.6
AQK Training
Prediction accuracy per class
Overall
Set
1
2
3
4
30
96.6
95.3
89.0
98.0
94.1
60
96.5
97.4
93.7
98.2
96.1
90
96.4
97.5
95.2
98.3
96.6
180
96.8
97.8
96.7
98.8
97.2
270
97.3
97.0
96.0
98.0
97.1
MNN Training
Prediction accuracy per class
Overall
Set
1
2
3
4
30
97.4
98.5
94.8
99.9
97.2
60
97.7
98.6
96.6
100.0
97.9
90
97.7
98.5
97.1
100.0
98.0
180
97.6
98.8
97.5
100.0
98.2
270
97.7
98.0
96.0
99.7
97.8
algorithm during the training phase. The computational time required to carry out the classification of the final map shown in Fig.4 using the same system mentioned above was approximately 2.5 min. Key parameters to find “good solutions” with SA are the rate of reduction of the so-called temperature (t) and the length of the Markov Chain M . The former should be selected high enough so that the acceptance ratio obtained within a Markov chain is at least 95%. Here, the acceptance ratio is defined as the total number of accepted transitions to the total proposed transitions (i.e. M ). The latter, on the contrary should not be too small because this would tend to compromise the quality of the solution (the so called quenching). Additionally, the number of Markov chains should be large enough to ensure an acceptable convergence. In the present case, the maximum December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
22
TABLE IV M C N EMAR ’ STATISTIC Z OBTAINED FOR ALL PAIRWISE COMPARISONS OF THE ACCURACY STATEMENTS OBTAINED WITH EACH METHOD AND FOR EACH TRAINING SET SIZE PER CLASS nh . CLASSIFIERS IS SIGNIFICANT AT THE
|Z| ≥ 1.95 DENOTE THAT THE DIFFERENCE IN ACCURACY OBTAINED BY TWO
95% CONFIDENCE LEVEL . A POSITIVE VALUE INDICATES THAT THE FIRST NAMED CLASSIFIER HAD THE HIGHER ACCURACY.
Training
MNN v
MNN v
MNN v
MNN v
AQK v
AQK v
AQK v
LDA v
LDA v
k-NN v
set size
AQK
LDA
k-NN
ML
LDA
k-NN
ML
k-NN
ML
ML
30
3.68
6.27
6.36
4.52
4.75
2.22
-1.31
3.04
5.49
-2.36
60
4.13
5.33
7.92
8.61
3.32
2.87
4.33
1.06
4.30
1.29
90
4.74
6.34
9.28
8.86
4.08
4.49
4.18
1.56
5.16
-0.56
180
0.54
5.01
6.05
7.97
3.07
5.06
7.06
0.89
3.01
3.06
270
2.56
5.08
5.58
8.56
2.99
3.84
7.12
0.65
3.98
4.33
TABLE V M EAN ENTROPY AND AMBIGUITY INDEX FOR INDIVIDUAL CLASSES OBTAINED WITH FOUR CLASSIFIERS CALIBRATED WITH THE LARGEST TRAINING SET.
Method
Entropy per class 1
2
3
Ambiguity Index 4
1
2
3
4
ML
0.279
0.107
0.136
0.194
0.640
0.559
0.635
0.536
k-NN
0.180
0.120
0.149
0.000
0.046
0.037
0.089
0.002
LDA
0.665
0.260
0.419
0.099
0.165
0.037
0.081
0.012
AQK
0.219
0.199
0.378
0.141
0.039
0.071
0.118
0.019
MNN
0.164
0.099
0.179
0.002
0.052
0.030
0.080
0.006
number of interactions were set to 150 000 and M = 150. The decrement of the annealing temperature was done with a geometric decrement rule which reduced its present value by 2.5% of every time that this parameter was updated. In addition to that, if the stoping criteria (e.g. the acceptance ratio was not less than 1% and percentage of reduction of the best solution after a number of Markov chains) were not fulfilled when the counter of the iterations reached its maximum limit, then a reheating strategy was put in place. For more details on this topic, the interested reader should refer to [36]. This approach allowed us to find solutions, ceteris paribus, that were better than those found by the other algorithms employed in this study. In this respect, SA applied to MNN behaves very similar to other discrete optimization techniques used during the calibration/learning phase, for instance in the band extraction method [23] or for the selection of a Fuzzy Rule system as proposed in [7]. Once the optimal transformations have been found, the execution time of the MNN method is similar to k-NN. In the case of a large classification problem like the one presented here, the computational burden to estimate k-NN with standard sorting techniques becomes unpractical. In such cases we selected the nearest neighbors of a query with a very efficient kd-tree algorithm [50], [51] implemented in the IMSL libraries [52].
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
23
VI. C ONCLUSIONS In this paper we have introduced a modified k-NN technique based on a combination of a local variance reducing technique and a linear embedding of the observation space into an appropriate Euclidean space (MNN). This method was applied to a (supervised) land use/cover(LUC) classification example using a Landsat scene. Our results showed that MNN classifications were significantly more accurate than that derived from k-NN, ML, AQK, and LDA (p − value < 0.05) as shown in Fig. 5. Further aspects of this method (besides the optimizing processing time), still have to be investigated before it becomes operational. So far, we have limited our analysis to the classification of agricultural (not reported here) and hydrological land use/cover types using Landsat images with a spatial resolution of 30 m. This needs to be extended considering a broader spectrum of LUC classes that might also be aggregated to different informational levels. Moreover, data from different sensors and platforms need to be analyzed to explore the sensitivity and efficiency of our method providing different spatial and spectral resolution of the data. Analyzing in detail the robustness and transferability of the optimized transformation of the observation space to other geographic locations will help to keep the cost for image calibration to a minimum. In other words, information on the scale and temporal invariance of the transformation is to be further investigated. The MNN technique might be further improved by considering non-linear transformations and/or some appropriate formulation for its variations within the predictor’s space, for instance, by incorporating features of the local linear embedding technique [53]. While being beyond the scope of this paper, these issues will direct our future research activities. ACKNOWLEDGMENT The authors appreciated the helpful comments from four anonymous reviewers and would like also to thank K. Hannemann for the preparation of the data. R EFERENCES [1] L. Bruzzone, “An approach to feature selection and classification of remote-sensing images based on the bayes rule for minimum cost,” IEEE Transactions on Geoscience and Remote Sensing, vol. 38, no. 1, pp. 429–438, 2000. [2] H. W. Zhu and O. Basir, “An adaptive fuzzy evidential nearest neighbor formulation for classifying remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 8, pp. 1874–1889, 2005. [3] K. R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B.Scholkopf, “An introduction to kernel-based learning algorithms,” IEEE Transactions on Neural Networks, vol. 12, no. 2, pp. 181–201, 2001. [4] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
New York:
Springer-Verlag, 2001. [5] D. Stathakis and A. Vasilakos, “Comparison of computational intelligence based classification techniques for remotely sensed optical image classification,” IEEE Transactions On Geoscience And Remote Sensing, vol. 44, no. 8, pp. 2305–2318, 2006. [6] J. A. Richards and X. Jia, Remote Sensing Digital Image Analysis: An Introduction, 4th ed.
Secaucus, NJ, USA: Springer-Verlag New
York, Inc., 2006. [7] A. B´ardossy and L. Samaniego, “Fuzzy rule-based classification of remotely sensed imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no. 2, pp. 362–374, 2002, doi: 10.1109/36.992798.
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
24
[8] P. Deer and P. Eklund, “A study of parameter values for a mahalanobis distance fuzzy classifier,” Fuzzy Sets Syst., vol. 137, no. 2, pp. 191–213, 2003, doi: 10.1016/S0165-0114(02)00220-8. [9] D. E. van de Vlag and A. Stein, “Incorporating uncertainty via hierarchical classification using fuzzy decision trees,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 1, pp. 237–245, 2007. [10] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Machine Learning, vol. 29, no. 2 - 3, pp. 131–163, 1997, doi:10.1023/A:1007465528199. [11] P. M. Atkinson and A. R. L. Tatnall, “Introduction neural networks in remote sensing,” International Journal of Remote Sensing, vol. 18, no. 4, pp. 699–709, 1997. [12] G. M. Foody and A. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 6, pp. 1335–1343, 2004. [13] A. Massa, A. Boni, and M. Donelli, “A classification approach based on svm for electromagnetic subsurface sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 9, pp. 2084–2093, 2005. [14] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967. [15] W. S. Cleveland and S. J. Devlin, “Locally weighted regression: An approach to regression analysis by local fitting,” Journal of the American Statistical Association, vol. 83, no. 403, pp. 596–610, 1988. [16] T. Hastie and R. Tibshirani, “Discriminant adaptive nearest neighbor classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 6, pp. 607–615, 1996. [17] H. Franco-Lopez, A. R. Ek, and M. E. Bauer, “Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method,” Remote Sensing of Environment, vol. 77, no. 3, pp. 251–274, 2001. [18] W. Q. Shang, H. K. Huang, H. B. Zhu, Y. M. Lin, Y. L. Qu, and H. B. Dong, “An adaptive fuzzy knn text classifier,” in Computational Science - Iccs 2006, Pt 3, Proceedings, ser. Lecture Notes in Computer Science.
Berlin: Springer-Verlag Berlin, 2006, vol. 3993, pp.
216–223. [19] G. J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition.
New York: Wiley, 1992.
[20] G. M. Foody, “Approaches for the production and evaluation of fuzzy land cover classifications from remotely-sensed data,” International Journal of Remote Sensing, vol. 17, no. 7, pp. 1317–1340, 1996. [21] G. Poggi, G. Scarpa, and J. B. Zerubia, “Supervised segmentation of remote sensing images based on a tree-structured mrf model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 8, pp. 1901–1911, 2005. [22] C. M. Bachmann, T. L. Ainsworth, and R. A. Fusina, “Exploiting manifold geometry in hyperspectral imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 3, pp. 441–454, 2005. [23] S. B. Serpico and G. Moser, “Extraction of spectral channels from hyperspectral images for classification purposes,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 2, pp. 484–495, 2007. [24] A. B´ardossy, G. S. Pegram, and L. Samaniego, “Modeling data relationships with a local variance reducing technique: Applications in hydrology,” Water Resourses Research, vol. 41, no. W08404, pp. 1–13, 2005, doi: 10.1029/2004WR003851. [25] J. Peng, D. R. Heisterkamp, and H. K. Dai, “Adaptive quasiconformal kernel nearest neighbor classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 5, pp. 656–661, 2004. [26] J. Mercer, “Functions of positive and negative type, and their connection with the theory of integral equations,” Philosophical Transactions of the Royal Society of London, vol. 209, no. 1, pp. 415–446, 1909. [27] G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach,” Neural Computation, vol. 12, no. 10, pp. 2385–2404, 2000. [28] B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Computation, vol. 12, no. 5, pp. 1207–1245, 2000. [29] K. Fukunaga, Introduction to statistical pattern recognition, 2nd ed. [30] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis.
San Diego, CA, USA: Academic Press Professional, Inc., 1990.
London: Academic Press, 1979.
[31] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [32] D. G. Lowe, “Similarity metric learning for a variable-kernel classifier,” Neural Computation, vol. 7, no. 1, pp. 72–85, 1995.
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
25
[33] P. C. Mahalanobis, “On the generalised distance in statistics,” Proceedings of the National Institute of Science of India, vol. 2, no. 1, pp. 49–55, 1936. [34] D. G. Goodin, J. Gao, and G. M. Henebry, “The effect of solar illumination angle and sensor view angle on observed patterns of spatial structure in tallgrass prairie,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 1, pp. 154–165, 2004. [35] E. H. Isaaks and R. M. Srivastava, An introduction to Applied Geostatistics.
New York: Oxford University Press, 1989.
[36] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing.
Chichester: John Wiley and Sons, 1989.
[37] R. C. Singleton, “Algorithm 347: an efficient algorithm for sorting with minimal storage [m1],” Commun. ACM, vol. 12, no. 3, pp. 185–186, 1969. [38] P. van Laarhoven and E. Aarts, Simulated annealing: theory and applications.
Dordrecht: Kluwer, 1992.
[39] C. Song, C. E. Woodcock, K. C. Seto, M. P. Lenney, and S. A. Macomber, “Classification and change detection using landsat tm data. when and how to correct for atmospheric effects?” Remote Sensing of Environment, vol. 75, no. 2, pp. 230–244, 2001. [40] X. Zhuang, B. A. Engel, D. F. Lozano-Garcia, R. N. Fernandez, and C. J. Johannsen, “Optimisation of training data required for neuroclassification,” International Journal of Remote Sensing, vol. 15, no. 16, pp. 3271–3277, 1994. [41] T. G. van Niel, T. R. McVicar, and B. Datt, “On the relationship between training sample size and data dimensionality of broadband multi-temporal classification,” Remote Sensing of Environment, vol. 98, no. 4, pp. 468–480, 2005. [42] G. M. Foody, A. Mathur, C. Sanchez-Hernandez, and D. S. Boyd, “Training set size requirements for the classification of a specific class,” Remote Sensing of Environment, vol. 104, no. 1, pp. 1–14, 2006. [43] J. L. van Genderen, “Remote sensing: statistical testing of thematic map accuracy,” Remote Sensing of Environment, vol. 7, no. 11, pp. 3–14, 1978. [44] P. M. Mather, Computer processing of remotely-sensed images, 3rd ed.
Chichester: Wiley, 2004.
[45] R. G. Congalton, “A review of assessing the accuracy of classification of remotely sensed data,” Remote Sensing of Environment, vol. 37, no. 1, pp. 35–46, 1991. [46] G. M. Foody, “Status of land cover classification accuracy assessment,” Remote Sensing of Environment, vol. 80, no. 1, pp. 185–201, 2002. [47] S. Gopal and C. Woodcock, “Theory and methods for accuracy assessment of thematic maps using fuzzy sets,” Photogrammetric Engineering and Remote Sensing, vol. 60, no. 2, pp. 181–188, 1994. [48] G. M. Foody, “Local characterization of thematic classification accuracy through spatially constrained confusion matrices,” International Journal of Remote Sensing, vol. 26, no. 6, pp. 1217–1228, 2005. [49] C. L. Mallows, “Some comments on Cp ,” Technometrics, vol. 15, no. 4, pp. 661–667, 1973. [50] J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Transactions on Mathematical Software, vol. 3, no. 3, pp. 209–226, 1977. [51] S. M. Omohundro, “Efficient algorithms with neural network behaviour,” Journal of Complex Systems, vol. 1, no. 2, pp. 273–347, 1987. [52] IMSL, Fortran Subroutines for Mathematical Applications, Visual Numerics, Inc., Houston, 1997, http://www.vni.com. [53] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
Luis Samaniego received his B.S.degree (summa cum laude) in civil engineering from the Escuela Polit´ecnica Nacional, Quito, Ecuador, in 1991. In 1997 he received a M.S. degree in infrastructure planning and in 2003 a Ph.D. degree in civil engineering (summa cum laude) both from the University of Stuttgart, Germany. He has fifteen years experience in numerical analysis and water resources planning. He is currently a Senior Researcher at the Helmholtz-Centre for Environmental Research-UFZ, Leipzig, Germany. His research and work activities are in stochastic hydrology, applications of systems theory in integrated water resources planning, geostatistics, and remote sensing.
December 20, 2007
DRAFT
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING
26
Andr´as B´ardossy received his diploma in mathematics in 1979 and a Ph.D. degree in mathematics 1981 from the E¨otv¨os Lor´and Tudom´anyegyetem, Budapest, Hungary. In 1993 he recieved a Ph.D. degree in civil engineering from the University of Karlsruhe, Germany. Since 1995 he is professor for hydrology and geohydrology at the university of Stuttgart. His research activities are in stochastic hydrology, hydrometeorology, geostatistics, and remote sensing.
Karsten Schulz received a diploma degree in Geoecology in 1993 and a Ph.D. degree in soil physics in 1997, both from the University of Bayreuth, Germany. He is currently deputy head of the Department of Landscape Ecology at the Helmholtz-Centre for Environmental Research-UFZ, Leipzig. His main research activities are in model development of land surface processes, uncertainty estimation, remote sensing and data assimilation.
December 20, 2007
DRAFT