Classification of multi-spectral satellite image data using improved NRBF neural networks Xiaoli Tao* and Howard E. Michelξ University of Massachusetts Dartmouth, Dartmouth MA 02747 ABSTRACT This paper describes a novel classification technique—NRBF (Normalized Radial Basis Function) neural network classifier based on spectral clustering methods. The spectral method is used in the unsupervised learning part of the NRBF neural networks. Compared with other general clustering methods used in NRBF neural networks, such as KMeans, the spectral method can avoid the local minima problem and therefore multiple restarts are not necessary to obtain a good solution. This classifier was tested with satellite multi-spectral image data of New England acquired by Landsat 7 ETM+ sensors. Classification results show that this new neural network model is more accurate and robust than the conventional RBF model. Furthermore, we analyze how the number of the hidden units affects training and testing accuracy. These results suggest that this new model may be an effective method for classification of multispectral satellite image data. Keywords: Normalized Radial Basis Function (NRBF), Spectral Clustering Method, Linear Least Square, classification, satellite image processing
1. INTRODUCTION An efficient technique for improving the classification accuracy of multi-spectral satellite image data is essential for obtaining reliable materials which can supply enough information for both environment protection and natural resource development. Methods for classification of a multi-spectral satellite image can be mainly divided into two categories. One is unsupervised classification and the other is supervised classification. Unsupervised methods do not have a training process, and as such is able to obtain classification results directly and quickly. The result is not always as good as expected. On the other hand, supervised methods utilize available samples to learn the proposed models during a training process, and the classification results are always superior to that obtained by using unsupervised methods. Many artificial neural network models have been applied in the classification of remote sensing data since the end of the 1980’s. Among these models, the MLP (Multi-Layer Preceptron) has been the most popular one. However, remote sensing data always involve many samples (due to multiple bands) and, as we know, the MLP is computationally inefficient with its multi-layers and the back-propagation properties. Compared with the MLP, a RBF neural network only has a single hidden layer, which results in exponentially decreasing computation complexity 1. In this paper, we propose an efficient supervised classification method—NRBF (Normalized RBF) neural network based on the spectral clustering method to process the satellite image data of the New England area. Radial-basis functions were first introduced in the solution of the real interpolation problems. The early work on this subject is surveyed by Powell 2. However, their roots are entrenched in much older pattern recognition techniques 3. Radial basis function neural networks have been applied in many research fields since it was proposed, especially in pattern recognition, function approximation and time series predication etc 4,5. In our experiments, we try using different band combinations to explore the effect of each of them on the ability of the NRBF to perform classification. Moreover, we compare the results of the NRBF based on the spectral method and one based on the K-Means method. The experimental results show that the former is superior in most cases. In order to set up a more efficient model, we compare the results with respect to different number of the hidden units in the NRBF *
[email protected]; phone 401-273-7692; FAX 508-999-8489; Electrical and Computer Engineering Department, University of Massachusetts Dartmouth, 285 Old Westport Road, North Dartmouth, MA 02747; ξ
[email protected]; phone 508-910-6465; FAX 508-999-8489; Electrical and Computer Engineering Department, University of Massachusetts Dartmouth, 285 Old Westport Road, North Dartmouth, MA 02747;
neural network to find the optimal number according to both the training and testing samples. Finally, we extract one part of the original image with the help of MultiSpecW32 software based on the combination of band 2,3,4. The classification work is performed on the extracted part by the proposed models and the results are compared and analyzed.
2. CLASSIFICATION METHOD RBF neural networks have a strong biological background. In the field of the brain cortex, local regulated and folded receptive field is the characteristic of the reflection of the brain. Based on this characteristic, Moody and Darken 6,7 proposed a new neural network structure, which is referred to as RBF neural network. Figure 1 shows the basic topological structure of RBF neural network.
Figure 1: Topological Structure of RBF Neural Networks
2.1 RBF neuron network In the RBF neural networks, radial basis functions are embedded into a two layer feed-forward neural network. The network has a set of inputs and a set of outputs. Between the inputs and outputs there is a layer of processing units referred to as hidden units. Each hidden unit is implemented with a radial basis function. In the RBF neural networks, the nodes of the hidden layer generate a local response of input prompting through the radial basis functions, and the output layer of RBF neural networks realize the linear weighted combination of the output of the hidden basis functions. There is a large class of radial-basis functions. The following functions are of particular interest in the study of RBF networks: 1. Multiquadrics:
φ ( r ) = ( r 2 + c 2 )1 / 2 2.
φ (r ) = 3.
for some c>0 and r ∈ R
Inverse Multiquadrics:
1 ( r + c 2 )1 / 2
for some c>0 and r ∈ R
2
Gaussian functions:
r2 2 2σ
φ (r ) = exp −
for some σ > 0 and r ∈ R
In pattern classification applications the Gaussian function is preferred since it is a localized basis function which means that φ → 0 as r → ∞ . In this paper, the Gaussian functions are used as the radial basis functions of the hidden layer. Based on the RBF neural network structure and the chosen radial basis function, if f l ( X j ) is the lth output of the output layer, and
φi ( X j )
is the output of ith radial basis function, then the whole network forms a
mapping: Nr
f l ( X j ) = ∑ λil φi (X j ) , i =1
(1)
where X
j
is an M-dimensional feature vector; Y j is the actual output vector corresponding to the input vector X j ;
N r is the number of the hidden units and λi l is the connection weight between the ith hidden unit and lth output unit. This weight shows the contribution of the hidden unit to the corresponding output unit. 2.2 NRBF neural network In some approaches, the output of the hidden layer is normalized by the sum of all the radial basis function components just as in the Gaussian-mixtures estimation model 8. The following equation shows the normalized radial basis functions:
Φi (X ) =
φ i ( X − ci
)
∑i =1φi ( X − ci Nr
(2)
)
A network obtained using the normalized form for the basis functions is called a normalized RBF neural network and has some important properties. This normalized form bounds the hidden output in the range between 0 and 1, which can be interpreted as probability values in classification application to indicate which hidden units is most activated 10. Moreover, we modified the localized behavior to non-localized behavior which provides the right decision to all input vectors with this normalization. Using the above normalized form, we let f l ( X j ) be the lth output of the output layer, and Φ i ( X j ) be the normalized radial basis functions. Therefore, the equation (1) can be changed to the following one: Nr
f l ( X j ) = ∑ λil Φ i (X j )
(3)
i =1
2.3 NRBF neural network learning From the above descriptions, we know that RBF neural networks realize the nonlinear mapping R → R through a linear combination of nonlinear basis functions. If we fix the number of the hidden units, there are two sets of parameters that need to be fixed: one is the centers ci and widths σ i of each basis function; the other is the n
n
connection weights W between the output layer and the hidden layer. Consequently, learning in the NRBF neural networks actually can be divided into two separate stages. When ci and σ i are known, the output can be obtained only through the linear weighted combination of the hidden layer. Thus estimating the centers ci and widths σ i is the key to constructing RBF neural networks. The K-Means clustering method, which is simple and fast, has been the most popular method used to estimate the centers. However, the clustering result of the K-means method is very sensitive to its initial estimate and the convergence often occurs at localized minima. Usually, multiple restarts are needed to acquire an appropriate clustering solution. Here, we consider a new clustering approach—a spectral clustering method for centers and widths to deal with these problems in using K-Means. 2.3.1 Spectral clustering method for centers and widths Spectral methods as a promising alternative to clustering have emerged in many areas 11,12. Compared with some general clustering methods such as K-Means or MLE (Maximum Likelihood Estimation), spectral methods can avoid the local minima problems and find the global optimality to some degree. Hence, multiple restarts are not required. The key idea of the spectral method is to reduce the input space by transforming the input data set into a set of orthogonal eigenvectors that are derived from the distance between the input data. The spectral clustering technique has been applied in many areas such as image processing and initializing cluster centers, etc 12 13. The following are the key steps of the spectral clustering algorithm 13. 1.
The input data set S = {s1 , s 2 ,...s P } is given in a high dimension space. The aim is to cluster in to N r subsets.
2.
Compute the Euclidean distance between every two different points and form the distance matrix D ∈ R
P* P
defined by Dij = s i − s j . 3.
Form the affinity matrix A ∈ R
P* P
from the distance matrix D, A is defined by Aij = exp(− Dij / 2σ ) if 2
2
i ≠ j , Aij = 0 if i = j . The free parameter σ 2 is the adjustment parameter for controlling the rate at 4.
which the affinity matrix falls off with distance. Define the diagonal matrix M whose (i, i) element is the sum of A’s ith row, and construct the matrix
L = M −1 / 2 AM −1 / 2 . P* N 5. Obtain the N r largest orthogonal eigenvectors of L and form the eigenvector matrix X ∈ R r by stacking them in columns. N
6.
Now, each row of X is a point in R r . Normalize each X’s row to unit length and cluster the normalized matrix into N r clusters via K-means.
7.
Each original input data s i belongs to cluster j if and only if the row i of normalized matrix X belongs to cluster j.
Compared with the K-means method, the spectral method considers the “connectness” of the data while the Kmeans method considers their “compactness”. After the spectral cluatering, we know which cluster each input data belongs to and how many input data belong to this cluster. The next step is to obtain the parameters in the radial basis functions with the clustering result. As we have mentioned, we use the Gaussian function as the radial basis function in the model. In most cases, the input data are in high dimension space. For simplicity, we assume that each dimension is independent of the others and the variance is the same over all the dimensions. Equivalently, this is to say that the Gaussian of each hidden unit is a hyper-sphere. We can use maximum likelihood techniques to obtain the centers and widths for the Gaussian functions 14. The detailed procedure is attached in the appendix. Here, we just show the estimation results for the center and width in each cluster: N
ci = ∑ S in / N n =1
1 σ = NM 2 i
where:
2
N
∑S n =1
in
− ci
S in is the data point belongs to cluster i. ci is the center of cluster i.
σi
is the width of cluster i.
N is the number of data points belonging to cluster i. M is the number of the dimensions of input vector. 2.3.2 Linear Least Square method for output layer weights After the above estimation process, we have obtained the parameters of the radial basis functions: centers and widths. In training data, the output of the hidden layer corresponding to each sample is known. Therefore, the rest work can be viewed as a simple single layer neural network that the output is the linear combination of the input. The weights are estimated to minimize the following sum-squared error function: R=
1 P ∑ (Y j − f ( X j ))T (Y j − f ( X j )) 2 j =1
(4)
where, P is the number of the sample. Here, we apply the linear least square (LLS) method to obtain the weights instead of the gradient descent method. In notations, the number of features of each sample is M, the number of units of output corresponding to each sample is K, and the number of the RBF units of the neural network is Nr. If
W ∈ R K *N r is the matrix of all connection weights, Φ ∈ R N r *P is the hidden layer output matrix and
Y = (Y1 , Y2 ,..., YP ) T ∈ R K *P , we can apply the LLS method to yield the least-squares solution for the weights if Φ
is fully ranked 9,15, 16 :
W T = Φ +Y , + T −1 T where Φ = (Φ Φ ) Φ is the pseudo-inverse of Φ . Actually, if Φ is not a fully ranked matrix but a rankdeficient matrix, we need to consider use singular value decomposition to solve this problem. Usually, the input patterns are sufficiently large and full of variation, so the Φ is always fully ranked. Compared with the gradient descent method, this method gives exact solution and doesn’t need iterative optimization. We have described the whole training procedure for a NRBF neural network with a fixed number of the hidden units, while how to choose the number of hidden units is still a tough problem. In the experiment, we try different number of the hidden units and find the optimal one to be used in the final model.
3. MULTI-CLASS CLASSIFICATION Data processing involves multiclass classification and many efficient methods have been recommended 17. In our NRBF neural network approach, the initialization of the output of each sample is necessary. Here, we adopt a simple but efficient method to initialize the output of each class: if all the training samples and the testing samples need to be separated into K classes, then the dimension of the output vector is also defined as K. The output vector is denoted by a unit vector with one entry as “1” and the others as “0”'s, in which the “1” indicates the classification result of the input. After the neural network is trained according the above methods, the output with respect to each testing sample can be computed. Finally, we can use the following decision function to calculate which class the testing sample actually belongs to:
1 fl (X j ) = 0
if f l ( X j ) = arg max f i ( X j ) else
1≤ i ≤ K
For example, if the output is (0.9,0.3,0.4,0.5,0.7), then output of this sample actually is (1,0,0,0,0). This sample belongs to the first class.
4. EXPERIMENTAL RESULTS Landsat TM records data in seven different bandwidths that are broken into different spectral regions of visible, infrared, and thermal infrared. Table 1 is the band descriptions. Band 1 2 3 4 5 6 7
Wavelength ( µ m) 0.45-0.52 0.52-0.60 0.63-0.69 0.76-0.90 1.55-1.75 10.40-12.50 2.08-2.35
Resolution (meters) 30 30 30 30 30 60 30
Spectral Region Visible Blue Visible Green Visible Red Reflective Infrared Mid-Infrared Thermal Infrared Mid-Infrared
Table 1 TM Band Description
Since 1972 satellites have provided high-resolution multispectral imagery using high technology. The TM bands have been selected to maximize their capabilities for detecting different types of Earth resources: TM band 1 can penetrate waters as well as map along coastal areas and is useful for analyzing soil-vegetation differentiation and for distinguishing forest types; TM band 2 can detect green reflectance from healthy vegetation and can be used in vegetation discrimination; TM band 3 is designed for vegetation discrimination through detecting chlorophyll absorption in vegetation; TM band 4 is ideal for near-IR reflectance peaks in healthy green vegetation and will aid in crop identification and the contrast between land and water; the two mid-IR red bands on TM (bands 5 and 7) are
useful for vegetation and soil moisture studies, and discriminating between rock and mineral types; and the thermal-IR band on TM (band 6) is designed for discrimination of rock formations 18. The study area used in this experiment is a satellite image of New England. This data was obtained through Landsat 7 ETM+ sensors on July 7th, 1999. The number of pixels of this data set is 6600*6000 for band 1-7 except band 6 whose number is 3300*3000 because of its different resolution. In the following experiment, we use all of the bands except band 6. We extracted the training samples and the testing samples from the data set based on remote sensing expert knowledge 18. The number of the training samples used is 520. The number of the testing samples used is 448. In order to avoid over-fitting, these samples are extracted from the different positions for each cluster. As we have defined for the multi-class classification, the output of each class is listed in the following table: Class Vegetation Soil (sparse vegetation) Sand Deep water Shallow water
1 0 0 0 0
Output 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
0 0 0 0 1
Table 2 Output table of each class
The number of the input units is equal to the number of bands used for training. In order to bound the input in the range from 0 to 1, we normalize the input by the following formula:
X jk =
X jk
1≤ i ≤ P 64 74 8 − min( X ik )
max( X ik ) − min( X ik ) 1424 3 1424 3 1≤ i ≤ P
1≤ i ≤ P
We compare the results from the spectral clustering method with those from the K-Means clustering method for the unsupervised stage of the RBF neural network. Both the training accuracy and testing accuracy are listed in the table 3. Additionally, we use different band combinations to show the role of each band in the classification. In the experiment of table 3, the number of the hidden units was fixed at 20. 1 * * * * *
2 * * * * * * *
* *
*
*
* *
Band (Input Data) 3 4 5 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
7
* * * * * * *
Spectral Accuracy (%) Training Testing 94.35 90.47 96.66 88.46 98.52 97.61 97.11 92.18 98.65 96.72 98.71 97.09 94.94 95.24 97.75 96.80 Unclear Unclear 98.71 97.71 99.10 97.47 97.43 91.81 95.19 90.40 98.26 96.42 Unclear Unclear 98.46 92.21 98.14 94.71 98.39 95.38
K-means Accuracy (%) Training Testing Unclear Unclear 97.56 89.13 98.65 95.16 97.50 91.44 99.55 97.99 98.52 95.98 95.06 95.16 97.94 93.22 Unclear Unclear 98.14 95.46 98.52 96.72 97.50 91.96 93.97 88.54 98.39 94.27 Unclear Unclear Unclear Unclear 98.26 93.37 97.43 93.97
Table 3 Accuracy of different band combination (Number of hidden units =20)
Here, the overall accuracy is considered and it is defined as: Accuracy% =
Number of samples classified correctly *100% Total samples
It is used in both training and testing stages. According to the experimental results, it can be seen that for some combinations, such as bands 2 and 3 and bands 3, 4, 5, and 7, the new NRBF model can do classifications very well, while the old NRBF model based on K-means fails. This suggests that the new model can relax the requirement of the features for the input pattern. While for some other combinations, such as bands 4, 5, and 7 and bands 2, 5, and 7, both the new model and the old model fail to perform correctly since some classes were misclassified to other classes. In general, for most cases, the accuracy of the classification for the new model is better than the old model. Next, we will try to find the optimal number for hidden units of the NRBF based on spectral method. Figure 2 shows the different accuracy choosing different number of hidden neurons in the NRBF neural network based on spectral clustering method. In this experiment, we use bands 2, 3, and 4. 100 98 96
Accuracy(%)
' o ' Training ' * ' Testing
94 92 90 88 86 84
5
10
15 20 25 Number of Hidden Units
30
35
Figure 2: Accuracy VS Number of Hidden units
From figure 2, it can be seen that the training accuracy increases as the number of hidden units increases while the testing accuracy almost reaches the best result when the number of the hidden units is 20. This means that the number of hidden units is not proportional to testing accuracy since over-fitting will occur during training. Also, in the experiment, we find that if the number of the hidden units is less than 8, the NRBF neural networks fail to classify the regions. It is because that the number of the hidden units is too small. In this experiment, we tried different number of hidden units to find the optimal number, which is time consuming. In our future work, we will investigate a positive approach to deal with this issue. In order to compare the classification ability of the new model with the old one more intuitively, we use the image to show the classified result. Typically, with the help of multisepc32 software, TM Bands 4, 3, and 2 can be combined to make false-color composite images where band 4 represents red, band 3, green, and band 2, blue. With this band combination, vegetation appears as shades of red, brighter reds indicating more vigorously growing vegetation. Sparse vegetation areas appear ranging from greens or browns depending on moisture and organic matter content. Urban areas will appear blue-gray in color. Deep, clear water is dark blue to black in color, while sediment-laden or shallow waters will appear lighter in color. Using this method, we generate the satellite image of New England. Since the whole data is too massive, we only extract part of the image. Here, we use a gray scale to indicate different colors. Since we didn’t list all kinds of the classes, some places are classified into the most possible class. For clearer comparison, we choose the gray scale in figure 4(c) for each class close to that in the generated image. Figure 3 is the original generated image, Figure 4(a) is the classified result of new model, and 4(b) is the classified result of old model.
Comparing these figures, it is easy to see that figure 4(a) is more like figure 3 than figure 4(b) and some samples of vegetation and urban land classes are misclassified to class deep-water in figure 4(b).
Figure 3: Part of the New England
Vegetation Sparse-land Urban land Shallow-water Deep -water
(a)
(b)
(c)
Figure 4: (a) Classification result of NRBF neural network using Spectral Method (b) Classification result of NRBF neural network using K-means Method (c) Gray level for each cluster in the classification result
5. CONCLUSION AND FUTURE WORK We have demonstrated that using spectral clustering method to optimize NRBF neural networks is an appropriate model for remote sensing data since it not only keeps the good properties of RBF neural network but also avoids the local minima problem of traditional RBF neural networks. From the results of the classification of the New England satellite image data, we can see it improves the accuracy of the classification. But there still are many places that need to be improved and discussed further, for example, how to choose the number of the hidden units. Also, in this paper, we only consider the accuracy for different band combination. In the future work, we will try to analyze the different kinds of band combinations to test the relationship between each class and each band and check it with the band descriptions. Finally, since the spectral method involves computing the inverse matrix of the input data, which is computationally expensive if the input data points are massive, some efficient approaches need to be explored to deal with the problem.
ACKNOWLEDGEMENTS This work was supported in part by the National Oceanic and Atmospheric Administration under grant NA16EC2374.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
Haykin, S. Neural Networks: A comprehensive Foundation. IEEE Computer Society Press, 1994. Powel, M.J.D. “Radial basis functions for multivariable interpolation: A review” IMA Conference on Algorithms for the Approximation of Functions and Data, PP143-167, RCMS, shrivenham, England. 1985. Tou, J. T., Gonzalez, R.C. Pattern Recognition. Reading, MA: Addison-Wesley, 1974. Matej, S., Lewwit, R. M., “Practical considerations for 3-D image reconstruction using spherically symmetric volume elements” IEEE Trans. On Medical Imaging, vol. 15, no.1, pp. 68-78, 1996. Casdagli, M. “Nonlinear predication of chaotic time series” Physical D, vol. 35, pp 335-35, 1989. Moody J, Darken C. “Fast Learning in Networks of Locally-turned Processing Units” Neural Computation, 1989 (1): 281-294, 1989. Moody J, Darken C. “Learning with localized Receptive Field” Proc 1988 Connectionist Models Summer school, Dtouretzky, G Hinton, and T Sejnowski(Eds.) Carnegie Mellon University, Morgan Kaufmann Publishers, 1988. Cha, I., Kassam, S. A. “ RBFN restoration of nonlinearly degraded images” IEEE Trans. On Image Processing, vol. 5, no. 6, pp. 964-975,1996. Bishop, C.M. Neural Networks for Pattern Recognition, Oxford University Press, New York, 1995. Robert J. Howlett and Lakhmi C. Jain. “Radial Basis Function Networks 1/2,” Physica-Verlag Heidelberg 2001. Chris H.Q.Ding, Xiaofeng He, Hongyuan Zha. “A spectral method to separate disconnected and nearlydisconnected Web graph components,” KDD’01 San Francisco, California, USA, 2001. Chris Brew, Sabine Schulte im Walde. “Spectral Clustering for German Verbs” Proceedings of 2002 Conference on Empirical Methods in Natural Language Processing, Philadelphia, Pa. July 2002. Andrew. Y. Ng, Michael. I. Jordan, and Yair Weiss. “On spectral clustering: Analysis and an algorithm.” In T. G. Dietterich, S.Becker, and Z. Ghahamani, editors, Advanced in Neural Information Processing Systems 14, Cambridge, MA.MIT Press, 2002 Render, R, A and H. F. Walker. “Mixture densities, maximum likelihood and EM algorithm,” SIAM Review 26(2), 195-239, 1984. Golub, G. and W. Kahan “Calculating the singular values and pseudo-inverse of a matrix,” SIAM Neural Computation 1(4), 465-469 1965. Rao, C. R and S. K. Mitra. “Generalized Inverse of Matrices and Its applications” New York: John Wiley, 1971. Mayoraz, E. and Alpaydin, E. “Support vector machines for multicalss classification” Proceeding of the International Workshop on Artificial Neural Networks (IWANN99), IDIAP Technical Report 98-06, 1999. Robert A. Schowengerdt, Remote Sensing: Models and methods for image processing. Academic Press, 1997.
Appendix: Here, we use MLE (Maximum Likelihood Estimation) approach to show how the parameters (mean and covariance matrix) in multi-normal function are estimated from the given data: Assume the data x1 , x 2 ,...x N ∈ ℜ
M
are i.i.d. (independently identically distributed) and modeled by an isotropic M-
dimensional Gaussian distribution N ( µ , σ I M ) , where the probability of joint distribution can be expressed as follows: 2
µ and σ 2 I M
N
N
1
n =1
n =1
( 2π σ ) M
p ( x1 , x 2 ,...x N ) = ∏ p ( x n ) = ∏
are the mean and covariance matrix. The
exp(−
To simplify the calculation, let
E ( µ , σ ) = − log p( x1 , x 2 ,...x N ) =
1 2σ 2
N
∑ n =1
xn − µ
2
+ MN log( 2π σ )
xn − µ 2σ 2
2
)
Maximizing the joint probability is equivalent to minimizing the E ( µ , σ ) . Take the derivatives of E ( µ , σ ) with respect to
µ and σ
to be zero, we obtain the closed-form solutions:
∂E ∂u
=0⇒
1 2σ 2
N
∑ 2(u − x n =1 N
n
)=0
⇒ µ = ∑ xn / N n =1
N 1 ∂E = 0 ⇒ −∑ x n − µ / σ 3 + MN σ ∂σ n =1 N 2 1 ⇒σ 2 = ∑ xn − µ MN n =1