© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org
Unsupervised Change Detection by Kernel Clustering Michele Volpia , Devis Tuia
a,b ,
Gustavo Camps-Vallsb and Mikhail Kanevskia
a Institute
of Geomatics and Analysis of Risk, Universit´e de Lausanne Quartier UNIL-Sorge, 1015 Lausanne, Switzerland {michele.volpi,devis.tuia,mikhail.kanevski}@unil.ch b Image Processing Laboratory, Universitat de Val` encia Catedr´atico A. Escardino - 46980 Paterna, Val`encia, Spain
[email protected] ABSTRACT
This paper presents a novel unsupervised clustering scheme to find changes in two or more coregistered remote sensing images acquired at different times. This method is able to find nonlinear boundaries to the change detection problem by exploiting a kernel-based clustering algorithm. The kernel k-means algorithm is used in order to cluster the two groups of pixels belonging to the ‘change’ and ‘no change’ classes (binary mapping). In this paper, we provide an effective way to solve the two main challenges of such approaches: i) the initialization of the clustering scheme and ii) a way to estimate the kernel function hyperparameter(s) without an explicit training set. The former is solved by initializing the algorithm on the basis of the Spectral Change Vector (SCV) magnitude and the latter is optimized by minimizing a cost function inspired by the geometrical properties of the clustering algorithm. Experiments on VHR optical imagery prove the consistency of the proposed approach. Keywords: Unsupervised change detection, Kernel k-means, Clustering, Remote sensing, VHR imagery
1. INTRODUCTION In the recent years, the increasing number of Earth Observation satellites and the growing resolutions of the optical images acquired increased the interest of the remote sensing community to the change detection issue. Satellites with enhanced spatial (fine scale detection) and temporal resolution (near real time monitoring) provide images particularly adapted to study the evolution of the ground cover: the detection of changes between images acquired at different times over the same geographical area has become a major research area. The analysis of the multitemporal images can be addressed by two main paradigms: supervised and unsupervised (or clustering). The former requires a labeled set of examples provided by the user. It is particularly well-suited when many classes of land cover evolutions have to be detected and summarized in a map. The latter does not require labeled information: it generally provides binary maps and is particularly adapted to real life problems, where the influence of the user must be minimal (i.e. no fitting of parameters, no manual thresholding and no training set definition).1–3 In the literature, many unsupervised change detection algorithms can be found. Several studies have been carried out regarding the automatic analysis of the difference image.4 An example is the algorithmic comparison of the scale invariant Mahalanobis distance between the pixels of the difference image, in order to map a specific typology of change.5 The advent of high resolution images within a short revisit time urged the need of studying the statistics of the multitemporal difference image in an accurate way, in order to be effective when applying these methods. Bayes decision rule and Markov random fields were introduced in order to deal with automatic selection (exploiting expectation maximization algorithm) of thresholds and to consider contextual information in the process.6 Similar principles of estimation of the distribution are nowadays adopted in the Change Vector Analysis (CVA)7–9 where Spectral Change Vectors (SCV) are computed by subtracting the multidimensional corresponding pixels at different times and studying their magnitude (discriminating radiometric changes) and angles (discriminating ground cover classes). Further author information: Michele Volpi, IGAR, Bˆ atiment Amphipˆ ole, Quartier UNIL-Sorge, CH-1015 Lausanne.
+41 21 692 3546
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org In parallel to unsupervised techniques, advanced machine learning techniques were introduced in the remote sensing community. In particular, kernel methods10 have shown accurate and robust behavior when applied to remote sensing data.11–13 Supervised change detection techniques exploiting these paradigms have shown their relevancy in many studies,14–16 thus opening interesting fundamental research areas between pattern recognition and remote sensing image processing. The rationale of the paper is to study the flexibility of kernel methods to nonlinearities in the context of unsupervised change detection problem. Kernel methods build linear models in a (high dimensional) feature space where data is mapped to. The resulting solution in the input space is nonlinear. Classical unsupervised partitioning method are suboptimal at detecting binary changes because of the nonlinear nature of the change: however, if the feature space spanned by the kernel function maximizes separability, a linear partitioning algorithm can discover the correct partitioning. In order to exploit this intuition, the well known k-means algorithm is adapted to find clusters in that higher dimensional space by using its kernel counterpart.10, 17, 18 On one hand the results are improved with respect to the classical explicitly linear algorithms, but on the other hand some problems arise. Issues related to the initialization of the kernel k-means and to the optimization of the kernel hyperparameter(s) are discussed, and effective ways to overcome these problems are proposed. The rest of the paper is organized as follows. Section 2 introduces the kernel k-means algorithm. In Section 3 the change detection setting is introduced, discussing key problems and proposed solutions. Section 4 evaluates the effectiveness of the proposed approach on a QuickBird pansharpened image. Section 5 concludes the paper and discusses some future perspectives.
2. THE KERNEL K-MEANS This section presents the kernel k-means algorithm starting from the well known k-means clustering technique.19 This approach is very useful to discover a natural partitioning of the input patterns X in their input space X into k groups. The algorithm assigns a cluster membership k to the elements xi ∈ X that minimize the distance from its gravity centers mk : d2 (xi , mk ) = kxi − mk k2
(1)
where mk = |π1k | j∈πk xj , the πk are the elements assigned to cluster k, and |πk | is their number. When all the patterns are assigned to their corresponding clusters, the mean vectors mk are updated by averaging the coordinates of elements of the cluster, thus providing a new gravity center. Then, the process is iterated until the centers stabilize and the algorithm converges to a minimum of d2 (xi , mk ), ∀ i, k. Standard k-means is particularly adapted to solve linear problems, i.e. the input space is organized in spherical clusters. P
The kernel version of k-means relies on the same principles, but instead of working in the input space X , it works in a higher dimensional feature space H, in which non-spherical clusters in the input space are mapped into spherical ones, and can consequently be detected correctly. This higher dimensional space is usually induced by a mapping function ϕ(·), whose images ϕ(xi ) correspond to mapped samples in H. Using mapped samples, the k-means becomes:
d2 (ϕ(xi ), mk ) = kϕ(xi ) − mk k2 ; 1 X ϕ(xj ) . where mk = |πk | j∈π
(3)
d2 (ϕ(xi ), mk ) = hϕ(xi ), ϕ(xi )i + hmk , mk i − 2hϕ(xi ), mk i.
(4)
(2)
k
This is equivalent to
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org By plugging (3) into (4), and replacing the dot product hϕ(·), ϕ(·)i by a proper kernel function k(·, ·), the kernel k-means formulation10, 17, 18 is obtained as: X 2 X 1 hϕ(xi ), ϕ(xj )i hϕ(x ), ϕ(x )i − j m |πk |2 j,m∈π |πk | j∈π k k X 1 2 X = k(xi , xi ) + k(xi , xj ) . k(xj , xm ) − 2 |πk | j,m∈π |πk | j∈π
d2 (ϕ(xi ), mk ) = hϕ(xi ), ϕ(xi )i +
(5)
k
k
Kernel functions are applied to overcome the problems relied to the explicit computation of the mapping function, that can be costly and difficult. With kernels, the value of the dot product in the feature space is evaluated directly by using the values of the samples in the input space. The kernel values can be interpreted as a similarity measure between samples, and thus the kernel k-means can be seen as a clustering algorithm that first groups similar points and then separates them, working linearly in a higher dimensional feature space. As for the linear version, the process is iterated until convergence, by assigning the cluster membership k, and solving the following minimization problem:
n arg min{d2 (ϕ(xi ), mk )} = arg min k(xi , xi ) + mk
mk
o X 1 2 X k(x , x ) . k(x , x ) − i j j m |πk |2 j,m∈π |πk | j∈π k
(6)
k
Note that, since the mapping is not explicitly known, the exact coordinates of the cluster centers in H cannot be computed explicitly. However, the explicit centers coordinates are not needed to assign a pattern to its cluster. When needed, the pixel closest to the center (the centroid or medioid) is considered to be the center. In terms of complexity, the kernel k-means scales O(n2 (ǫ + m)), where n is the number of samples, ǫ is the number of the iterations and m is the dimensionality. The classical k-means algorithm on the other hand is less demanding, scaling O(ǫnmk), where k is the number of clusters.
3. THE CHANGE DETECTION SETTING As mentioned above, two main issues have to be solved in order to apply this clustering algorithm in a completely unsupervised way. In this section, the problems of initialization and of kernel parameters estimation are detailed.
3.1 Overcoming bad initializations The main issue of unsupervised algorithms is to find a proper initialization allowing the method to converge to a global minimum (‘true’ clusters) or to a local minimum sufficiently low. This issue can be greatly alleviated by choosing a near-optimal initialization, i.e. finding centers within or close enough to the correct clusters. In this case, the idea is to initialize the kernel k-means with two subsets that belong with high probability to their respective clusters. In order to estimate the ‘change’ - respectively ‘no change’ - class distributions from which the centroids are computed, the Spectral Change Vector7 magnitude is exploited. The Change Vector Analysis (CVA) has been widely used in many applications and, after [Bovolo and Bruzzone, 2006], a wide range of applications has been reported (as initialization,20 change detector itself8 or the exploratory data analysis21 ) and its behavior is now largely understood. SCV consists in computing the difference image and analyzing the distribution of magnitudes and angles in order to discriminate changes. In this paper we exploit the magnitude vector computed as δ = kxti2 − xti1 k, where {t ,t } the xi 1 2 are the multidimensional pixels at the two times. This distribution can be seen as a mixture of two Gaussians, one for the unchanged pixels and another for the changed pixels. The interested reader can find more details in the aforementioned papers.
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org This principle is illustrated in Figure 1, where two subsets can be randomly initialized from each distribution, according to a threshold selected on the minimum-error decision rule. It is worth mentioning that at this point, a change map can be produced by assigning pixels according to the threshold on the magnitude distribution (the T in Figure 1). This solution is not optimal for many reasons: as pointed out in many studies21 such an approach suffers greatly of residual registration misalignments and noise. Moreover, the separation of the ‘change’ and ‘no change’ clusters should be addressed by a nonlinear approach, due to the great overlapping between the two class distributions. This is particularly true for high / very high geometrical resolution images, where the class distributions strongly overlap and the images are affected by high variances.
f(δ) No change region
}
Overlapping zone
Change region
0
δ -t
T
+t
In the approach proposed in this paper, the near Gaussian distributions of Figure 1 are exploited in order to estimate the cluster centroids (as a pseudo training set for the kernel k-means). Once a good initialization is obtained by a correct thresholding, the convergence is also favored (within the limits dictated by some possible sensor noise or outliers in the pixels magnitude values). It is worth mentioning that the number of samples needed for the estimation of the kernel parameter(s) is only marginally important, while the description of the distribution should be complete in order to reproduce the variability of the data (i.e. the extent of the clusters).
Figure 1. Mixture of two Gaussians describing the two classes. The initial centers are randomly picked from the ‘change’ and ‘no change’ regions. The T corresponds to a near optimal thresholding, separating the ‘no change’ distribution (the left one) to the ‘change’ one (right one).
3.2 Learning the kernel parameter in an unsupervised way The second big challenge is related to the fitting of the kernel parameters. Usually, such parameters are chosen by evaluating the algorithm on some labeled example (e.g. leave-one-out and cross validation) and retaining the parameters set Θ that minimizes some predefined cost function. In this paper we propose an unsupervised and geometrically-inspired cost function, that automatically chooses a correct parameters set for the dataset at hand. This cost function is formulated as: arg min Θ
(P
1 k |πk |
2 i∈πk d (ϕ(xi ), mk ) P 2 k6=p d (mk , mp )
P
)
,
(7)
where Θ is the set of parameters of the kernel function to be learned. The optimal geometrical distribution of the samples is formulated in terms of intra-cluster and inter-cluster distances. The distances induced in the feature space are used as an index to achieve the best possible description for kernel k-means. The minimization in Eq. (7) can be seen as a maximization of the cluster separability: the minimization of the numerator favors compact clusters in terms of distances to their centers, while the maximization of the denominator suggests a kernel that maps samples into two clusters that have distant centers. Any search algorithm (e.g. line/grid search, simulated annealing and others) can be used to estimate the cost generated by the elements of a given set of parameters Θ.
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org 3.3 The change detection algorithm Starting with two coregistered and equalized images, the proposed algorithm can be resumed in 4 steps, illustrated by Figure 2.
Image t2
Image t1
INITIALIZATION on the SCV magnitude
PARAMETER estimation
CENTERS computation
1) Initialization: in order to apply image differencing the scenes must be first preprocessed in terms of histogram matching an normalization of theirs values. Then the initialization based on the thresholding of the SCV magnitude can be applied. The images are subtracted, and then the difference vectors (the SCV) are analyzed considering their norm. The threshold and the confidence ([T − t; T + t] in Figure 1) indicate where the pixels are mixed in terms of magnitude: thus, outside this interval, the samples are more likely to belong to either groups and a pseudo training set can be extracted. 2) Parameter estimation: once the correct threshold is found, the kernel k-means algorithm is exploited as a wrapper to choose the best parameter optimizing Eq. (7): the pseudo training set is clustered with different parameters until a minimum in the cost function is found.
3) Centroids computation: the algorithm returns the centroids and the cluster assignment that corresponds to the best parameter. It is worth CLUSTER mentioning that the choice of computing the centroids only on a subset assignements of pixels and not on the whole image is justified by two criteria: first, by the strong overlap of the classes. This way, unbiased centers of the binary two classes are computed, and the pixels in the overlapping part of CHANGE MAP the distributions are assigned to the corresponding cluster (which is the closest in H). Secondly, estimating the centers only on a proper Figure 2. The workflow of the proposed apsubset of the image reduces both the computational time (in terms proach. of algorithm convergence) and computational complexity of the single iterations of kernel k-means. This is an important issue, especially taking into account the computational cost of the partitioning algorithm. 4) Change detection: once the centroids are computed, each pixel in the difference image is assigned to the cluster which center is closest in H. To do that, kernel k-means with the optimized parameters is applied to the entire difference image.
4. DATA AND EXPERIMENTAL RESULTS In this section, the proposed approach to unsupervised change detection is validated on a pansharpened QuickBird image of the city of Zurich (Switzerland). The available images are shown in Figure 3. The results of the proposed method are compared to simple thresholding of the histogram and to the linear k-means. Accuracies are evaluated in terms of AUC (Area Under the ROC curve) estimated on the basis of some available ground truth. Additionally, Binary confusion matrices are provided for a single experiment, and basic accuracy metrics are provided as well. A total of 15 experiments (corresponding to different initializations of the pseudo training set) were carried out for the kernel kmeans approach (with a Gaussian RBF kernel function) and for the (a) (b) linear k-means. The centers are evaluated on the pseudo training Figure 3. Images in (a) 2002 and (b) 2006 set extracted on the basis of the given regions of the magnitude histogram: at each iteration, a balanced set of 500 pseudo training set is extracted and used for computing centroids for both clustering approaches. In order to have a deterministic term of comparison, the CVA was carried out in terms of thresholding of the magnitude distribution.
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org 4.1 Results and discussion The AUC for the three approaches are illustrated in Table 1. The nonlinear solution improves the linear counterpart and the unidimensional thresholding, reaching globally higher accuracies. It is interesting to see that the proposed approach reduces greatly the false alarms provided by the k-means clustering on the difference image (cf. ROCs in Figure 4 and confusion matrices in Table 2). Regarding the true changes (the true positives for the ’change’ class) the algorithms are not so far in terms of performance, given the simplicity of the difference image. On the other hand, the false positive rate is greatly reduced by the proposed approach. The averaged ROC curves illustrated in Figure 4 and the AUC (for the k-means and the kernel k-means approach) show great performances in terms of detection of true changes for all the algorithms, with a better performance for the kernel approach. CVA 0.912
kernel k-means 0.974
k-means 0.923
Table 1. Mean Area Under the ROC Curve (AUC). The averages are based on 15 independent experiments for the k-means and for the kernel k-means; the CVA was carried out only once.
Actual Labels vs. Predicted (P) CVA k-means kernel k-means P
C C NC
11031 12778 OA 93.25
NC
C
NC
C
13241 12242 12987 12160 190926 67 191180 149 Basic Accuracy Metrics OA 93.96
κ 0.57
κ 0.62
OA 96.34
NC
7766 196401 κ 0.74
Table 2. Confusion matrices and accuracy metrics (Overall Accuracy - OA; Cohen’s Kappa - κ) for three models (randomly chosen). ’C’ corresponds to the ’change’ class and ’NC’ to ’no change’.
In Figure 5, the final binary change detection maps are illustrated. The black color correspond to the ’change’ class while the white color correspond to the ’no change’ class. Note that for the kernel k-means and for the k-means approaches, the maps represent the number of hits of the clustering algorithms. Thanks to the proper initialization, both algorithms converge to the correct solution in the most of the iterations, only the k-means has clustered unwanted pixels in an experiment (the light gray regions in Figure 5). 1 0.9
True positive rate
0.8 0.7 0.6 0.5 0.4 0.3
CVA k-means
0.2 0.1 0
0
kernel k-means 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
False positive rate Figure 4. ROC curves.
1
Observing the results of the clustering in Figure 5, the kernel approach shows less false alarms, greatly reducing the effect of the shadows on the change detection. The CVA approach is affected by both shadowed pixels and remarkable differences in the reflective response of the ground, but the true positives ratio is high. The k-means approach reduces the effect of the shadows, but is greatly affected by the differences in the reflectance of the images and shows potential instability even if the centers are initialized on the magnitude. The kernel k-means finally shows a reduced effect of both the principal sources of errors. The shadows and the shadow-related changes are rarely assigned to the ’change’ cluster. The radiometric differences between the images, even if less than with the k-means scheme, still influence the false positive rate. Globally, in terms of true positive detection, the k-means and the kernel k-means perform similarly, but the most noticeable difference is found in terms
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org
(a) CVA
(b) k-means
(c) kernel k-means
Figure 5. (a) CVA (b) k-means and (c) kernel k-means. For (a) and (b) white corresponds to 0 hits (100 % hits for the class ’no change’) and black correspond to 15 hits (100 % hits for the class ’change’). In this case the term ’hits’ refers to the total number of times that a given pixel is assigned to a given cluster.
of false alarms. These observations can be summarized by observing given accuracy metrics. The κ in Table 1 gives some insight about this intuition, growing for the kernel k-means that greatly reduces the false alarm rate. The Gaussian RBF kernel parameters were tuned by line search in the range of σ ∈ [0.01, 0.1, . . . , 6]. The minimum of the function presented in (7) suggested average parameters in the interval [2.5, 3] corresponding to the mean distance of the pixels in the pseudo training set (in average 2.9).
5. CONCLUSIONS AND FUTURE WORK The kernel clustering method shows great flexibility to the problem of change detection, finding nonlinear solutions to the problem. The main issues of such approach are discussed and solved: first, the initialization was addressed by finding a threshold on the magnitude distribution, and a geometrically inspired cost function (which represents the ideal cluster geometry in the kernel induced feature space) has been proposed to estimate the optimal kernel parameters (if any). Finally, the computational cost is kept low by controlling the number of samples needed for estimating the centers (the label assignment step costs O(n2 m) for the kernel matrix computation, where n is the number of pixels and m the variables). The proposed approach shows improvements with respect to classical clustering techniques. Moreover, the unsupervised kernel clustering introduces great potential in term of flexibility (e.g. introducing adapted kernels to the data, or using composite kernels for the fusion of information12 ) and seems thus to be a candidate for future research in unsupervised (and semi-supervised and even active) change detection approaches.
ACKNOWLEDGMENTS This work has been partly supported by the Swiss National Science Foundation projects no. 200021-126505/1 and PBLAP2-127713/1 and by the Spanish Ministry of Science and Innovation under projects AYA2008-05965C04-03 and CSD2007-00018.
© SPIE Remote Sensing 2010, Toulouse (F) - downloaded from kernelcd.org REFERENCES [1] Singh, A., “Digital change detection techniques using remote sensing data,” Int. J. Rem. Sens. 10(6), 989–1003 (1997). [2] Coppin, P., Jonckheere, I., Nackaerts, K., Muys, B., and Lambin, E., “Digital change detection methods in ecosystem monitoring: a review,” Int. J. Remote Sens. 25(9), 1565 – 1596 (2004). [3] Radke, R. J., Andra, S., Al-Kofahi, O., and Roysam, B., “Image change detection algorithms: A systematic survey,” IEEE Trans. Image Process. 14(3), 294 – 307 (2005). [4] Fung, T., “An assessment of TM imagery for land-cover change detection,” IEEE Trans. Geosci. Remote Sens. 28(4), 681–684 (1990). [5] Bruzzone, L. and Serpico, S. B., “Detection of changes in remotely-sensed images by the selective use of multi-spectral information,” Int. J. Remote Sens. 18(18), 3883 – 3888 (1997). [6] Bruzzone, L. and Prieto, D. F., “Automatic analysis of the difference image for unsupervised change detection,” IEEE Trans. Geosci. Remote Sens. 38(3), 1171–1182 (2000). [7] Malila, W. A., “Change vector analysis: An approach for detecting forest changes with Landsat,” in [Proc. LARS Mach. Process. Remotely Sensed Data Symp.], 326 – 335 (1980). [8] Bovolo, F. and Bruzzone, L., “A theoretical framework for unsupervised change detection based on change vector analysis in polar domain,” IEEE Trans. Geosci. Remote Sens. 45(1), 218–236 (2006). [9] Bovolo, F. and Bruzzone, L., “A split-based approach to unsupervised change detection in large size multitemporal images: application to Tsunami-damage assessment,” IEEE Trans. Geosci. Remote Sens. 45(6), 1658–1671 (2007). [10] Shawe-Taylor, J. and Cristianini, N., [Kernel Methods for Pattern Analysis], Cambridge University Press (2004). [11] Camps-Valls, G. and Bruzzone, L., [Kernel Methods for Remote Sensing Data Analysis], J. Wiley & Sons (2009). ´ [12] Camps-Valls, G., G´ omez-Chova, L., Mu˜ noz-Mar´ı, J., Rojo-Alvarez, J. L., and Mart´ınez-Ram´on, M., “Kernelbased framework for multi-temporal and multi-source remote sensing data classification and change detection,” IEEE Trans. Geosci. Remote Sens. 46(6), 1822–1835 (2008). [13] Camps-Valls, G. and Bruzzone, L., “Kernel-based methods for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens. 43(3), 1 – 12 (2005). [14] Nemmour, H. and Chibani, Y., “Multiple support vector machines for land cover change detection: an application for mapping urban extensions.,” J. Photogr. Remote Sensi. 61, 125–133 (2006). [15] Bovolo, F., Camps-Valls, G., and Bruzzone, L., “A support vector domain method for change detection in multitemporal images,” Pattern Recogn. Lett. 31(10), 1148–1154 (2010). [16] Volpi, M., Tuia, D., Kanevski, M., Bovolo, F., and Bruzzone, L., “Supervised change detection in VHR images: a comparative analysis,” in [IEEE International Workshop on Machine Learning for Signal Processing], (2009). [17] Girolami, M., “Mercer kernel-based clustering in feature space,” IEEE Trans. Neural Net. 13(3), 780 – 784 (2002). [18] Dhillon, I., Guan, Y., and Kulis, B., “A unified view of kernel k-means, spectral clustering and graph cuts,” Tech. Rep. UTCS Technical Report No. TR-04-25, University of Texas, Austin, Departement of Computer Science (2005). [19] MacQueen, J., “Some methods for classification and analysis of multivariate observations,” in [Proc. 5th Berkeley Symp. on Math. Statist. and Prob.], Proc. 5th Berkeley Symp. on Math. Statist. and Prob , 281 – 297 (1967). [20] Bovolo, F., Bruzzone, L., and Marconcini, M., “A novel approach to unsupervised change detection based on a semisupervised SVM and a similarity measure,” IEEE Trans. Geosci. Remote Sens. 46(7), 2070 – 2082 (2008). [21] Bovolo, F., Bruzzone, L., and Marchesi, S., “Analysis of the effects of registration noise in multitemporal VHR images,” in [ESA-EUSC, ESRIN ], (2008).