Neural Networks 15 (2002) 1059–1068 www.elsevier.com/locate/neunet
2002 Special Issue
Generalized relevance learning vector quantization Barbara Hammera,*, Thomas Villmannb a Department of Mathematics and Computer Science, University of Osnabru¨ck, Albrechtstraße 28, 49069 Osnabru¨ck, Germany Clinic for Psychotherapy and Psychosomatic Medicine, University of Leipzig, Karl-Tauchnitz-Straße 25, 04107 Leipzig, Germany
b
Abstract We propose a new scheme for enlarging generalized learning vector quantization (GLVQ) with weighting factors for the input dimensions. The factors allow an appropriate scaling of the input dimensions according to their relevance. They are adapted automatically during training according to the specific classification task whereby training can be interpreted as stochastic gradient descent on an appropriate error function. This method leads to a more powerful classifier and to an adaptive metric with little extra cost compared to standard GLVQ. Moreover, the size of the weighting factors indicates the relevance of the input dimensions. This proposes a scheme for automatically pruning irrelevant input dimensions. The algorithm is verified on artificial data sets and the iris data from the UCI repository. Afterwards, the method is compared to several well known algorithms which determine the intrinsic data dimension on real world satellite image data. q 2002 Elsevier Science Ltd. All rights reserved. Keywords: Clustering; Learning vector quantization; Adaptive metric; Relevance determination
1. Introduction Self-organizing methods such as the self-organizing map (SOM) or vector quantization (VQ) as introduced by Kohonen provide a successful and intuitive method of processing data for easy access (Kohonen, 1995). Assumed data are labeled, an automatic clustering can be learned via attaching maps to the SOM or enlarging VQ with a supervised component to so-called learning vector quantization (LVQ) (Kohonen, 1997; Meyering & Ritter, 1992). Various modifications of LVQ exist which ensure faster convergence, a better adaptation of the receptive fields to optimum Bayesian decision, or an adaptation for complex data structures, to name just a few (Kohonen, 1997; Sato & Yamada, 1995; Somervuo & Kohonen, 1999). A common feature of unsupervised algorithms and LVQ consists in the fact that information is provided by the distance structure between the data points which is determined by the chosen metric. Learning heavily relies on the commonly used Euclidian metric and hence crucially depends on the fact that the Euclidian metric is appropriate for the respective learning task. Therefore, data are to be preprocessed and scaled appropriately such that the input dimensions have approximately the same importance for the * Corresponding author. Tel.: þ49-541-969-2488; fax: þ 49-541-9692770. E-mail address:
[email protected] (B. Hammer).
classification. In particular, the important features for the respective problem are to be found, which is usually done by experts or with rules of thumb. Of course, this may be time consuming and requires prior knowledge which is often not available. Hence, methods have been proposed which adapt the metric during training. Distinction sensitive LVQ (DSLVQ), as an example, automatically determines weighting factors to the input dimensions of the training data (Pregenzer, Pfurtscheller, & Flotzinger, 1996). The algorithm adapts LVQ3 for the weighting factors according to plausible heuristics. The approaches (Kaski, Sinkkonen, & Peltonen, 2001; Sinkkonen & Kaski, 2002) enhance unsupervised clustering algorithms by the possibility of integrating auxiliary information such as a labeling into the metric structure. Alternatively, one could use information geometric methods in order to adapt the metric such as in Hofmann (2000). Concerning SOM, another major problem consists in finding an appropriate topology of the initial lattice of prototypes such that the prior topology of the neural architecture mirrors the intrinsic topology of the data. Hence various heuristics exist to measure the degree of topology preservation, to adapt the topology to the data, to define the lattice a posteriori, or to evolve structures which are appropriate for real world data (Bauer & Villmann, 1997; Fritzke, 1995; Martinetz & Schulten, 1993; Ritter, 1999; Villmann, Der, Herrmann, & Martinetz, 1997). In all tasks, the intrinsic dimensionality of data plays a crucial role since it determines an important aspect of the optimum
0893-6080/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved. PII: S 0 8 9 3 - 6 0 8 0 ( 0 2 ) 0 0 0 7 9 - 5
1060
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
neural network: the topological structure, i.e. the lattice for SOM. Moreover, superfluous data dimensions slow down the training for LVQ as well. They may even cause a decrease in accuracy since they add possibly noisy or misleading terms to the Euclidian metric where LVQ is based on. Hence a data dimension as small as possible is desirable for the above mentioned methods in general, for the sake of efficiency, accuracy, and simplicity of neural network processing. Therefore various algorithms exist which allow to estimate the intrinsic dimension of the data: PCA and ICA constitute well-established methods which are often used for adequate preprocessing of data and which can be implemented with neural methods (Hyva¨rinen & Oja, 1997; Oja, 1995). A Grassberger – Procaccia analysis estimates the dimensionality of attractors in a dynamic system (Grassberger & Procaccia, 1983). SOMs which adapt the dimensionality of the lattice during training like the growing SOM (GSOM) automatically determine the approximate dimensionality of the data (Bauer & Villmann, 1997). Naturally, all adaptation schemes which determine weighting factors or relevance terms for the input dimensions constitute an alternative method for determining the dimensionality: The dimensions which are ranked as least important, i.e. they possess the smallest relevance terms, can be dropped. The intrinsic dimensionality is reached when an appropriate quality measure such as an error term changes significantly. There exists a wide variety of input relevance determination methods in statistics and the field of supervised neural networks, e.g. pruning algorithms for feedforward networks as proposed in Grandvalet (2000), the application of adaptive relevance determination for the support vector machine or Gaussian processes (van Gestel, Suykens, de Moor, & Vandewalle, 2001; Neal, 1996; Tipping, 2000), or adaptive ridge regression and the incorporation of penalizing function as proposed in Grandvalet (1998), Roth (2001), and Tibshirani (1996). However, note that our focus lies on improving metric based algorithms via involving an adaptive metric which allows dimensionality reduction as a byproduct. The above mentioned methods do not yield a metric which could be used in self-organizing algorithms but primarily investigate the goal of sparsity and dimensionality reduction in neural network architectures or alternative classifiers. In the following, we will focus on LVQ since it combines the elegancy of simple and intuitive updates in unsupervised algorithms with the accuracy of supervised methods. We will propose a possibility of automatically scaling the input dimensions and hence adapting the Euclidian metric to the specific training problem. As a byproduct, this leads to a pruning algorithm for irrelevant data dimensions and the possibility of computing the intrinsic data dimension. Approaches like Kaski (1998) clearly indicate that often a considerable reduction in the data dimension is possible without loss of information. The main idea of our approach is to introduce weighting factors to the data dimensions which are adapted automatically such that the classification
error becomes minimal. Like LVQ, the formulas are intuitive formulas and can be interpreted as Hebbian learning. From a mathematical point of view, the dynamics constitute a stochastic gradient descent on an appropriate error surface. Small factors in the result indicate that the respective data dimension is irrelevant and can be pruned. This idea can be applied to any generalized LVQ (GLVQ) scheme as introduced in Sato and Yamada (1995) or other plausible error measures such as the Kullback – Leiblerdivergence. With the error measure of GLVQ, a robust and efficient method results which can push the classification borders near to the optimum Bayesian decision. This method, generalized relevance LVQ (GRLVQ), generalizes relevance LVQ (RLVQ) (Bojer, Hammer, Schunk, & Tluk von Toschanowitz, 2001) which is based on simple Hebbian learning and leads to worse and unstable results in the case of noisy real life data. However, like RLVQ, GRLVQ has the advantage of an intuitive update rule and allows efficient input pruning compared to other approaches which adapt the metric to the data involving additional transformations as proposed in Gath and Geva (1989), Gustafson and Kessel (1979) and Tsay, Shyu, and Chang (1999) or depend on less intuitive differentiable approximations of the original dynamics (Matecki, 1999). Moreover, it is based on a gradient dynamics compared to heuristic methods like DSLVQ (Pregenzer et al., 1996). We will verify our method on various small data sets. Moreover, we will apply GRLVQ to classify a real life satellite image with approx. 3 mio. data points. As already mentioned, weighting factors allow us to approximately determine the intrinsic data dimensionality. An alternative method is the GSOM which automatically adapts the lattice of neurons to the data and hence gives hints about the intrinsic dimensionality as well. We compare our GRLVQ experiments to the results provided by GSOM. In addition, we relate it to a Grassberger–Procaccia analysis. We obtain comparable results concerning the intrinsic dimensionality of our data. In the following, we will first introduce our method GRLVQ, present applications to simple artificial and real life data, and finally discuss the results for the satellite data.
2. The GRLVQ algorithm Assume a finite training set X ¼ {ðxi ; yi Þ , Rn £ {1; …; C}li ¼ 1; …; m} of training data is given and the clustering of the data into C classes is to be learned. We denote the components of a vector x [ Rn by ðx1 ; …; xn Þ in the following. GLVQ chooses a fixed number of vectors in Rn for each class, the so called prototypes. Denote the set of prototypes by {w1 ; …; wM } and assign the label ci ¼ c to wi iff wi belongs to the cth class, c [ {1; …; C}: The receptive field of wi is defined by Ri ¼ {x [ Xl;wj lx 2 wi l # lx 2 wj l}:
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
The training algorithm adapts the prototypes wi such that for each class c [ {1; …; C}; the corresponding prototypes represent the class as accurately as possible. That means, the difference of the points belonging to the cth class, {xi [ Xlyi ¼ c}; and S the receptive fields of the corresponding prototypes, ci ¼c Ri ; should be as small as possible for each class. For a given data point ðx; yÞ [ X denote by mðxÞ some function which is negative if x is classified correct, i.e. it belongs to a receptive field Ri with ci ¼ y; and which is positive if x is classified wrong, i.e. it belongs to a receptive field Ri with ci – y: Denote by f : R ! R some monotonically increasing function. The general scheme of GLVQ consists in minimizing the error term S¼
m X
f ðmðxi ÞÞ
ð1Þ
i¼1
via a stochastic gradient descent. Given an example ðxi ; yi Þ; the update rule of LVQ2.1 is wJ U wJ þ e ðxi 2 wJ Þ;
wK U wK 2 e ðxi 2 wK Þ
where e [ ð0; 1Þ is the so-called learning rate and wJ the nearest correct prototype, wK the nearest incorrect prototype. Usually, this update is only performed if the prototypes fall within a certain window of the decision border. This update can be obtained as a stochastic gradient descent on the error function (1) if we choose m as dJ 2 dK ; dJ and dK being the squared Euclidian distances of xi to the nearest correct or wrong prototype, respectively. f is the identity restricted to the window of interest and 0 outside. The concrete choice of f as the identity and mðxi Þ ¼ hdJ ; dJ being the squared Euclidian distance of xi to the nearest prototype, say wJ ; and h ¼ 1 if x is classified correct, h ¼ 21; if x is classified wrong, would yield the standard LVQ update ( J w þ e ðxi 2 wJ Þ; if yi ¼ cj J w U ð2Þ wJ 2 e ðxi 2 wJ Þ; otherwise where e [ ð0; 1Þ: Note that the condition on mðxi Þ of being negative iff xi is classified correctly is here violated. Consequently, the resulting function is highly discontinuous. Hence, the usefulness of this error function can be doubted and the corresponding gradient descent method will likely show instable behavior. The choice of f as the sigmoidal function sgdðxÞ ¼ ð1 þ expð2xÞÞ21 and
mðxi Þ ¼
dJ 2 dK dJ þ dK
where dJ is the squared Euclidian distance to the next prototype labeled with yi ; say wJ ; and dK is the squared Euclidian distance to the next prototype labeled with a label not equal to yi ; say wK ; yields a particular powerful and noise tolerant behavior since it combines adaptation near the optimum Bayesian borders like LVQ2.1, whereby prohibiting the possible divergence of LVQ2.1 as reported in Sato
1061
and Yamada (1995). We refer to the update as GLVQ: DwJ U e
sgd0 ðmðxi ÞÞdK i ðx 2 wJ Þ; ðdJ þ dK Þ2
ð3Þ
sgd0 ðmðxi ÞÞdJ i ðx 2 wK Þ DwK U 2e ðdJ þ dK Þ2 Obviously, the success of GLVQ crucially depends on the fact that the Euclidian metric is appropriate for the data and the input dimensions are approximately equally scaled and equally important. Here, we introduce input weights l ¼ ðl1 ; …; ln Þ; li $ 0 in order to allow a different scaling of the input dimensions hence making possibly time consuming preprocessing of the data superfluous. Substituting the Euclidian metric kx 2 yk by its scaled variant kx 2 yk2l ¼
n X
li ðxi 2 yi Þ2 ;
ð4Þ
i¼1
the receptive field of prototype wi becomes Ril ¼ {x [ Xl;wj kx 2 wi kl # kx 2 wj kl }: Replacing Ri by Ril in the error function S in Eq. (1) yields a different weighting of the input dimensions and hence an adaptive metric. Appropriate weighting factors l can be determined automatically via a stochastic gradient descent as well. Hence the rule (2) where the relevance factors li of the metric are integrated is accompanied by the update 8 i j 2 i j > < lm 2 e 1 xm 2 wm ; if y ¼ c lm U ; > : l þ e xi 2 wj 2 ; otherwise m 1 m m
ð5Þ
for each m, where e 1 [ ð0; 1Þ: We add a normalization to obtain klk ¼ 1 such that we avoid numerical instabilities for the weighting factors. This update constitutes RLVQ as proposed in Bojer et al. (2001). We remark that this update can be interpreted in a Hebbian way: Assumed the nearest prototype wJ is correct then those weighting factors are decreased only slightly for which the term ðxim 2 wJm Þ2 is small. Taking the normalization of the weighting factors into account, the weighting factors are increased in this situation iff they contribute to the correct classification. Conversely, those factors are increased most for which the term ðxim 2 wJm Þ2 is large if the classification is wrong. Hence, if the classification is wrong, precisely those weighting factors are increased which do not contribute to the wrong classification. Since the error function is not continuous in this case, this yields merely a plausible explanation of the update rule. However, it is not surprising that the method shows instabilities for large datasets which are subject to noise as we will see later. We can apply the same idea to GLVQ. Then the modification of Eq. (3) which involves the relevance factors
1062
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
li of the metric is accompanied by
2 dK i J lm U lm 2e 1 sgd0 2 w x m m ðdJ þ dK Þ2
2 dJ i K 2 xm 2 w m ðdJ þ dK Þ2
ð6Þ
for each m, wJ and wK being the closest correct or wrong prototype, respectively, and dJ and dK the respective squared distances in the weighted Euclidian metric. Again, this is followed by normalization. We term this generalization of RLVQ and GLVQ generalized relevance LVQ or GRLVQ, for short. Note that the update can be motivated intuitively by the Hebb paradigm taking the normalization into account: they comprise the same terms as in Eq. (5). Hence those weighting factors are reinforced most, coefficients of which are closest to the respective data point xi if this point is classified correct; otherwise, if xi is classified wrong, those factors are reinforced most, coefficients of which are far away. The difference in Eq. (6) compared to Eq. (5) consists in appropriate situation dependent weightings for the two terms and in the simultaneous update according to the next correct and next wrong prototype. Besides, the update rule obeys a gradient dynamics on the corresponding error function (1) as shown in Appendix A. Obviously, the same idea could be applied to any gradient dynamics. We could, for example, minimize a different error function such as the Kullback – Leibler divergence of the distribution which is to be learned and the distribution which is implemented by the vector quantizer. Moreover, this approach is not limited to supervised tasks, we could enlarge unsupervised methods like the neural gas algorithm (Martinetz & Schulten, 1993) which obey a gradient dynamics with weighting factors in order to obtain an adaptive metric.
3. Relation to previous research The main characteristics of GRLVQ as proposed in Section 2 are as follows: The method allows an adaptive metric via scaling the input dimensions. The metric is restricted to a diagonal matrix. The advantages are the efficiency of the method, interpretability of the matrix elements as relevance factors, and the correlated possibility of pruning. The update proposed in GRLVQ is intuitive and efficient, at the same time a thorough mathematical foundation can be found due to the gradient dynamics. As we will see in Section 3, GRLVQ provides a robust classification system which is appropriate for real-life data. Naturally, various approaches in the literature consider the questions of an adaptive metric, input pruning, and dimensionality determination, too. The most similar approach we are aware of constitutes DSLVQ (Pregenzer et al., 1996). The method introduces weighting factors, too,
and is based on LVQ3. The main advantages of our iterative update scheme compared to the DSLVQ update are threefold: Our update is very intuitive and can be explained with Hebbian learning; our method is more efficient since in DSLVQ each update step requires twice normalization; and, which we believe is the most important difference, our update constitutes a gradient descent on an error function, hence the dynamics can be mathematically analyzed and a clear objective can be identified. Recently, Kaski et al. proposed two different approaches which allow an adaptive metric for unsupervised clustering if additional information in an auxiliary space is available (Kaski et al., 2001; Sinkkonen & Kaski, 2002). Their focus lies on unsupervised clustering and they use the Bayesianframework in order to derive appropriate algorithm. The approach in Kaski et al. (2001) explicitly adapts the metric, however, it needs a model for explaining the auxiliary data. Hence, we cannot apply the method for our purpose, explicit clustering, i.e. developing the model. In Sinkkonen and Kaski (2002) an explicit model is no longer necessary. However, the method relies on several statistical assumptions and is derived for soft clustering instead of exact LVQ. One could borrow ideas from Sinkkonen and Kaski (2002). Alternatively to the statistical scenario, GRLVQ proposes another direct, efficient, and intuitive approach. Methods as proposed in Gustafson and Kessel (1979) and variations allow an adaptive metric for other clustering algorithms like fuzzy clustering. The algorithm in Gustafson and Kessel (1979) even allows a more flexible metric with non-vanishing entries outside the diagonal; however, the algorithms are naturally less efficient and require a matrix inversion, for example. In addition, well known methods like RBF networks can be put in the same line since they can provide a clustering with adaptive metric as well. Commonly, training is less intuitive and efficient than GRLVQ. Moreover, a more flexible metric which does not restrict to a diagonal matrix does no longer propose a natural pruning scheme. Apart from the flexibility due to an adaptive metric, GRLVQ provides a simple way of determining which data dimensions are relevant: we can just drop those dimensions with lowest weighting factor until a considerable increase in the classification error is observed. This is a common feature for all methods which determine weighting factors describing the metric. Alternatively, one can use general methods for determining the dimensionality of the data which are not fitted to the classifier LVQ. The most popular approaches are probably ICA and PCA, as already mentioned (Hyva¨rinen & Oja, 1997; Oja, 1995). Alternatively, one could use the above mentioned GSOM algorithm (Bauer & Villmann, 1997). However, because of its remaining hypercubical structure the results may be inaccurate. Another method is to apply a Grassberger– Procacciaanalysis to determine the intrinsic dimension. This method is unfortunately sensitive to noise (Grassberger & Procaccia, 1983; Wienholt, 1996). A wide variety of relevance
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
determination methods exists in statistics or in the supervised neural network literature, e.g. van Gestel et al. (2001), Grandvalet (1998, 2000), Neal (1996), Roth (2001), Tibshirani (1996), and Tipping (2000). These methods mostly focus on the task of obtaining sparse classifications and they do not yield an adaptive metric which could be used in self-organizing metric-based algorithms like LVQ and SOM. Hence a comparison with our method which primarily focuses on an adaptive metric for self-organizing algorithms would be interesting, but beyond the scope of this article.
4. Experiments 4.1. Artificial data We first tested GRLVQ on two artificial data sets from Bojer et al. (2001) in order to compare it with RLVQ. We refer to the sets as data 1 and data 2, respectively. The data comprise clusters with small or large overlap, respectively, of the clusters in two dimensions as shown in Fig. 1. We embed the points in R10 as follows: Assume ðx1 ; x2 Þ is one data point. Then we add eight dimensions obtaining a point ðx1 ; …; x10 Þ: We choose x3 ¼ x1 þ h1 ; …; x6 ¼ x1 þ h4 ; where hi comprises noise with a Gaussian distribution with variances 0.05, 0.1, 0.2, and 0.5, respectively. x7 ; …; x10 contain pure noise which is uniformly distributed in ½20:5; 0:5 and ½20:2; 0:2 or distributed according to Gaussian noise with variances 0.5 and 0.2, respectively. We refer to the noisy data as data 3 and data 4, respectively. In each run, data are randomly separated into a training and test set of the same size. e is chosen as constant 0.1, e 1 is chosen as 0.01. Since the weighting factors are updated in each step compared to the prototypes, the learning rate for the weighting terms should be smaller than the learning rate for the prototypes. Pretraining with simple LVQ until the prototypes nearly converge is mandatory for RLVQ, otherwise, the classification error is usually large and the results are not stable. It is advisable to train the prototypes with GLVQ for a few 100 epochs before using GRLVQ, either, in order to avoid instabilities. We use two prototypes for each class according to the priorly known distribution. The results on training and test set are comparable in all runs, i.e. the test set accuracy is not worse or only slightly worse compared to the accuracy on the training set. GRLVQ obtains about the same accuracy as RLVQ on all data sets (see Table 1) and clearly indicates which dimensions are less important via assigning small weighting factors to the less important dimensions which are known in these examples. Typical weighting factors are the vectors
lRLVQ ¼ ð0:5; 0:49; 0:005; 0:005; 0; 0; 0; 0; 0; 0Þ; lGRLVQ ¼ ð0:49; 0:4; 0:07; 0:02; 0; 0:02; 0; 0; 0; 0Þ
1063
for data 3 or the vectors lRLVQ ¼ ð0:13; 0:12; 0:12; 0:11; 0:1; 0:09; 0:1; 0:08; 0:07; 0:06Þ; lGRLVQ ¼ ð0:28; 0:36; 0:3; 0:05; 0; 0; 0; 0; 0; 0Þ
for data 4, hence clearly separating the important first two data dimensions from the remaining eight dimensions of which the first four contain some information. This is pointed out via a comparably large third weighting term for the second data set. The remaining four dimensions contain no information at all. However, GRLVQ shows a faster convergence and larger stability compared to RLVQ in particular if used for noisy data sets with large overlap of the classes as for data 4. There the separation of the important dimensions is clearer in GRLVQ than RLVQ. Concerning RLVQ, pre-training with LVQ and small learning rates were mandatory in order to ensure good results; the same situations turn out to be less critical for GRLVQ, although it is advisable to choose the learning rate for the weighting terms an order of magnitude smaller than the learning rate for the prototype update. These results indicate that GRLVQ is particularly well suited for noisy real life data sets. Based on the above weighting factors one can obtain a ranking of the input dimensions and drop all but the first two dimensions without increasing the classification error. 4.2. Iris data In a second test we applied GRLVQ to the well known Iris data set provided in the UCI repository of machine learning (Blake & Merz, 1998). The task is to predict three classes of plants based on four numerical attributes in 150 instances, i.e. we deal with data points in R4 with labels in {1; 2; 3}: Both, LVQ and RLVQ obtain an accuracy of about 0.95 for a training and test set if trained with two prototypes for each class. RLVQ shows a slightly cyclic behavior in the limit, the accuracy changing between 0.94 and 0.96. The computed weighting factors for RLVQ are
lRLVQ ¼ ð0:02; 0:01; 0:02; 0:89Þ indicating that based on the last dimension a very good classification would be possible. If more dimensions would be taken into account, a better accuracy of about 1.0 would be possible as reported in the literature. We could not produce such a solution with LVQ or RLVQ. Moreover, a perfect recognition of 1 would correspond to overfitting since the data comprises small noise as reported in the literature. GRLVQ yields the better accuracy of at least 0.96 on the training as well as the test set and obtains weighting factors of the form
lGRLVQ ¼ ð0; 0; 0:4; 0:6Þ; hence, indicating that the last dimension is most important as already found by RLVQ, and dimension 3 contributes to a better accuracy which has not been pointed out by RLVQ.
1064
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
Fig. 1. Artificial data sets consisting of three classes with two clusters each and small or large overlap, respectively; only the first two dimensions are depicted.
Note that the result obtained by GRLVQ is in coincidence with the results obtained, e.g. with rule extraction from feedforward networks (Duch, Adamczak, & Grabczewski, 2001). 4.3. Satellite data Finally, we applied the algorithm to a large real world data set: a multi-spectral LANDSAT TM satellite image of the Colorado area.1 Satellites of LANDSAT-TM type produce pictures of the earth in seven different spectral bands. The ground resolution in meter is 30 £ 30 for the bands 1 –5 and band 7. Band 6 (thermal band) has a resolution of 60 £ 60 only and, therefore, it is often dropped. The spectral bands represent useful domains of the whole spectrum in order to detect and discriminate vegetation, water, rock formations, and cultural features (Campbell, 1996; Merenyi, 1999). Hence, the spectral information, i.e. the intensity of the bands associated with each pixel of a LANDSAT scene, is represented by a vector in Rn with n ¼ 6: Generally, the bands are highly correlated (Augusteijn, Shaw, & Watson, 1993; Villmann & Merenyi, 2001). Additionally, the Colorado image is completely labeled by experts. There are 14 labels describing different vegetation types and geological formations. Thereby, the label probability varies in a wide range (Villmann, 1999). The size of the image is 1907 £ 1784 pixels. We trained RLVQ and GRLVQ with 42 prototypes (three for each class) on 5% of the data set till convergence. The algorithm converged in less than 10 cycles if e and e 1 were chosen as 0.1 and 0.01, respectively, as before. RLVQ yields an accuracy of about 86% on the training data as well as on the entire data set, however, it does not provide a ranking of the prototypes, i.e. all weighting terms are close to their initial value 0.16. GRLVQ leads to the better accuracy of 1
Thanks to M. Augusteijn (University of Colorado) for providing this image.
91% on the training set as well as the entire data set and provides a clear ranking of the several data dimensions. See Table 2 for a comparison of the results obtained by the various algorithms. In all experiments, dimension 6 is ranked as least important with weighting factor close to 0. The weighting factors approximate
lGRLVQ ¼ ð0:1; 0:17; 0:27; 0:21; 0:26; 0Þ in several runs. This weighting clearly separates the first two dimensions via a small weighting factor. If we prune dimension 6, 1, and 2, still an accuracy of 84% can be achieved. Hence this indicates, that the intrinsic data dimension is at most 3. Pruning one additional data dimension, dimension 4 still allows an accuracy of more than 50%, hence indicating that the intrinsic dimension may be even lower and the relevant directions are not parallel to the axes or even curved. These results are seen in Fig. 2 where the misclassified pixels in the respective cases are colored in black, the other pixels are colored corresponding to their respective class. For comparison we applied a Grassberger– Procacciaanalysis and the GSOM approach. The first estimates the intrinsic dimension as dGP < 3:1414 whereas GSOM generates a lattice of shape 12 £ 7 £ 3; hence indicating an intrinsic dimension between 1 and 3. These methods show a good agreement with the drastic loss of information if more than three dimensions are pruned with GRLVQ. Table 1 Percentage of correctly classified patterns (maximum 100) for the two artificial training data, data 1 and data 3, with and without additional noisy dimensions and RLVQ or GRLVQ, respectively
LVQ RLVQ GRLVQ
Data 1
Data 3
Data 2
Data 4
91–96 91–96 94–97
81–89 90–96 93–97
79–86 80–86 83–87
56 –70 79 –86 83 –86
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068 Table 2 Percentage of correctly classified patterns (maximum 100) and variance of the runs on the satellite data obtained in a 10-fold-cross-validation
Mean (train) Variance (train) Mean (test) Variance (test)
LVQ
RLVQ
GLVQ
GRLVQ
85.21 0.59 85.2 0.46
86.1 0.18 86.36 0.16
87.32 0.17 87.28 0.1
91.08 0.11 91.04 0.13
5. Conclusions The presented clustering algorithm GRLVQ provides a new robust method for automatically adapting the Euclidian metric used for clustering to the data, determining the relevance of the several input dimensions for the overall classifier, and estimating the intrinsic dimension of data. It reduces the input dimensions onto the essential parameters which is demanded to obtain optimal network structures. This is an important feature, if the network is used to reduce the data amount to subsequent systems in complex data analysis tasks as we can find in medical applications (image analysis) or satellite remote sensing systems, for example. Here, the reduction of data to be transferred is one of the most important features, however, preserving the essential information in the data. The GRLVQ-algorithm was successfully tested on artificial as well as real world data, a large and noisy satellite multi-spectral image. A comparison with other
1065
approaches validates the results even in real life applications. It should be noted that the GRLVQ algorithm can be easily adapted to other types of neural vector quantizers as neural gas or SOM, to mention just a few. Furthermore, it is clear that if we assume an unknown probability distribution of the labels for a given data set, the here discussed variant of GRLVQ tries to maximize the Kullback – Leibler divergence. Hence, we can state for this feature some similarities in our approach to the work of Kaski (Kaski et al., 2001; Sinkkonen & Kaski, 2002). Further considerations of GRLVQ should incorporate information theory approaches like entropy maximization to improve the capabilities of the network.
Appendix A The general error function (1) has here the special form S¼
m X i¼1
sgd
dJ 2 dK ; dJ þ d K
dJ and dK being the quadratic weighted distance to the closest correct or wrong prototype, wJ and wK ; respectively. For convenience we denote di ¼ kx 2 wi k2l and Di ¼ ðlj ðxj 2 wij ÞÞj¼1;…;n : Assume that the data come from a distribution P on the input space Rn and a labeling function y : Rn ! {1; …; C}: Then the continuous version of the
Fig. 2. Colorado-satellite-image: the pixels are colored according to the labels; above-left: original labeling; above-right: GRLVQ without pruning; below-left: GRLVQ with pruning of dimensions 1,2,6; below-right: GRLVQ with pruning of four dimensions. Misclassified pixels in the GRLVQ-generated images are black colored. (A colored version of the image can be obtained from the authors on request.)
1066
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
error function reads as
ð dJ 2 dK sgd d ðxÞ: dJ þ dK P Rn We assume that the sets Xc U {xlyðxÞ ¼ c} are measurable. Then we can write the error term in the following way: ! C ð X X dj 2 dk nðj; cÞnðk; : cÞdP ðxÞ; ðA1Þ sgd dj þ dk c¼1 Xc jVc
Eqs. (A2) and (A3) correspond up to a constant factor to the update Eq. (3). Eqs. (A4) and (A5) vanish due to the following reason: Denote by Tð j; kÞ the term sgdðdj 2 dk /dj þ dk Þnð j; ci Þnðk; : ci Þ: The integrand in Eq. (A4) yields X
jVci kV ⁄ ci
! dj 2 dk nðk; : ci Þ sgd dj þ dk 0
kV ⁄ c
£ d@
where j V c denotes the indices of prototypes labeled with c, k V ⁄ c denotes the indices of prototypes not labeled with c, nð j; cÞ is an indicator function for wj being the closest prototype to x among those labeled with c, and nðk; : cÞ is an indicator function for wk being the closest prototype to x among those not labeled with c. Denote by H the Heaviside function. Denote by ll V cl or ll V ⁄ cl the number of prototypes labeled with c or not labeled with c, respectively. Then we find ! X nð j; cÞ ¼ H Hðdl 2 dj Þ 2 ll V cl
X
1
X
Hðdl 2 dj Þ 2 ll V ci lA
lVci
dðdl 2 dj Þ
lVci
›dj ›dl £ 2 ›wi ›w i X ¼ Tð j; kÞdðdi 2 dj Þ2Di jVci ; j–i kV⁄ ci
X
þ
Tði; kÞ
X
dðdl 2 di Þð22ÞDi
lVci ;l–i
kV ⁄ ci
lVc
This term vanishes since d is symmetric and non-vanishing only for di ¼ dj and dl ¼ di ; respectively. In the same way, it can be seen that each integrand of Eq. (A5) vanishes. The derivative of Eq. (A1) with respect to li can be computed as
and nðk; : cÞ ¼ H
!
X
Hðdl 2 dk Þ 2 ll V ⁄ cl :
lV ⁄ c
The derivative of the Heaviside function is the delta function d which Ð is a symmetric function with dðxÞ ¼ 0 for x – 0 and R dðxÞ ¼ 1: We are interested in the derivative of Eq. (A1) with respect to every wi and every li ; respectively. Assume ci is the label of wi : Then the derivative of Eq. (A1) with respect to wi yields:
ð X 4dk 0 di 2 dk sgd D nði; ci Þnðk; : ci ÞdP ðxÞ 2 i d þ d ðd Xci kV i k i þ dk Þ ⁄ ci
C ð X c¼1
X Xc
jVc kV ⁄ c
sgd
0
! dj 2 dk nð j; cÞnðk; : cÞ dj þ dk
! dj dk j 2 k 2 dP ðxÞ 2 xi 2 wi 2 2 xi 2 wi ðdj þ dk Þ2 ðdj þ dk Þ2 ðA7Þ
ðA2Þ þ
X ð
X
sgd0
Xcl kVc l
lV ⁄ ci
dk 2 di dk þ di
24dk ðdk þ di Þ2
ðA3Þ
£ Di nðk; cl Þnði; : cl ÞdP ðxÞ ! ð X dj 2 dk ›nð j; ci Þ þ nðk; : ci ÞdP ðxÞ ðA4Þ sgd dj þ dk ›wi Xci jVc
ðA6Þ
þ
C ð X c¼1
X Xc
jVc kV ⁄ c
dj 2 dk sgd dj þ dk
! ðA8Þ
i
kV ⁄ ci
þ
X ð lV ⁄ ci
X Xcl
jVcl kV ⁄ cl
!
sgd
dj 2 dk ›nðk; : cl Þ nð j; cl Þ dP ðxÞ dj þ dk ›wi ðA5Þ
›nð j; cÞ ›nðk; : cÞ nðk; : cÞ þ nð j; cÞ dP ðxÞ ›li ›li
ðA9Þ
Eqs. (A6) and (A7) correspond to the update for l in Eq. (5). Eqs. (A8) and (A9) vanish since we obtain for the integrand
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
the following equation: " ! X X ›dj ›dl dðdl 2 dj Þ 2 Tð j; kÞ ›l i ›li lVc jVc kV ⁄ c
!# ›d l ›dk þ dðdl 2 dk Þ 2 ›li ›l i lV ⁄ c X0 X › dl ¼ B Tð j; kÞdðdl 2 dj Þ ›l @ i kV ⁄ c lVc X
jVc
1 ›dj 2 Tð j; kÞdðdl 2 dj Þ ›li C A lVc X
jVc
1 ›dl þ Tð j; kÞdðdl 2 dk Þ B ›li C A jVc @ kV⁄ c X
0
X
lV⁄ c
2
X
kV ⁄ c lV ⁄ c
Tð j; kÞdðdl 2 dk Þ
1 › dk ›l i C A
Again, this is zero because of the symmetry of d and the fact, that d is non-vanishing only for dl ¼ dj and dl ¼ dk ; respectively. Hence, the update of GRLVQ constitutes a stochastic gradient descent method with appropriate choices of the learning rates.
References Augusteijn, M. F., Shaw, K. A., & Watson, R. J. (1993). A study of neural network input data for ground cover identification in satellite images. In S. Gielen, & B. Kappen (Eds.), Proceedings of the ICANN’93, International Conference on Artificial Neural Networks (pp. 1010–1013). London, UK: Springer. Bauer, H.-U., & Villmann, T. (1997). Growing a hypercubical output space in a self-organizing feature map. IEEE Transactions on Neural Networks, 8(2), 218–226. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: Department of Information and Computer Science, University of California. Bojer, T., Hammer, B., Schunk, D., & Tluk von Toschanowitz, K. (2001). Relevance determination in learning vector quantization. Proceedings of European Symposium on Artificial Neural Networks (ESANN’01) (pp. 271–276). Brussels, Belgium: D facto publications. Campbell, J. (1996). Introduction to remote sensing. USA: The Guilford Press. Duch, W., Adamczak, R., & Grabczewski, K. (2001). A new method of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 12, 277– 306. Fritzke, B. (1995). Growing grid: A self-organizing network with constant neighborhood range and adaptation strength. Neural Processing Letters, 2(5), 9–13. Gath, I., & Geva, A. (1989). Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 773–791.
1067
van Gestel, T., Suykens, J. A. K., de Moor, B., & Vandewalle, J. (2001). Automatic relevance determination for least squares support vector machine classifiers. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks (pp. 13–18). Grandvalet, Y. (1998). Least absolute shrinkage is equivalent to quadratic penalization. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), ICANN’98 (Vol. 1) (pp. 201–206). Perspectives in neural computing, Berlin: Springer. Grandvalet, Y. (2000). Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks, 11(6), 1201– 1212. Grassberger, P., & Procaccia, I. (1983). Measuring the strangeness of strange attractors. Physica, 9D, 189 –208. Gustafson, D., & Kessel, W. (1979). Fuzzy clustering with a fuzzy covariance matrix. Proceedings of IEEE CDC’79 (pp. 761 –766). Hofmann, T. (2000). Learning the similarity of documents: An information geometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, & K. R. Mu¨ller (Eds.), Advances in neural information processing systems (Vol. 12) (pp. 914–920). Cambridge, MA: MIT Press. Hyva¨rinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. Proceedings of IJCNN’92 (pp. 413– 418). Kaski, S., Sinkkonen, J., & Peltonen, J. (2001). Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks, 12, 936–947. Kohonen, T. (1995). Learning vector quantization. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 537 – 540). Cambridge, MA: MIT Press. Kohonen, T. (1997). Self-organizing maps. Berlin: Springer. Martinetz, T., & Schulten, K. (1993). Topology representing networks. Neural Networks, 7(3), 507 –522. Matecki, U. (1999). Automatische Merkmalsauswahl fu¨r Neuronale Netze mit Anwendung in der pixelbezogenen Klassifikation von Bildern. Shaker, 1999. Merenyi, E. (1999). The challenges in spectral image analysis: An introduction and review of ANN approaches. Proceedings of European Symposium on Artificial Neural Networks (ESANN’99) (pp. 93 –98). Brussels, Belgium: D facto publications. Meyering, A., & Ritter, H. (1992). Learning 3D-shape-perception with local linear maps. Proceedings of IJCNN’92 (pp. 432 –436). Neal, R. (1996). Bayesian learning for neural networks. Berlin: Springer. Oja, E. (1995). Principal component analysis. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 753 – 756). Cambridge, MA: MIT Press. Pregenzer, M., Pfurtscheller, G., & Flotzinger, D. (1996). Automated feature selection with distinction sensitive learning vector quantization. Neurocomputing, 11, 19–29. Ritter, H. (1999). Self-organizing maps in non-euclidean spaces. In E. Oja, & S. Kaski (Eds.), Kohonen maps (pp. 97–108). Berlin: Springer. Roth, V. (2001). Sparse kernel regressors. In G. Dorffner, H. Bischof, & K. Hornik (Eds.), Artificial neural networks—ICANN 2001 (pp. 339– 346). Berlin: Springer. Sato, A. S., & Yamada, K. (1995). Generalized learning vector quantization. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems (Vol. 7) (pp. 423 –429). Cambridge, MA: MIT Press. Sinkkonen, J., & Kaski, S. (2002). Clustering based on conditional distribution in an auxiliary space. Neural Computation, 14, 217–239. Somervuo, P., & Kohonen, T. (1999). Self-organizing maps and learning vector quantization for feature sequences. Neural Processing Letters, 10(2), 151– 159. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267 –288.
1068
B. Hammer, T. Villmann / Neural Networks 15 (2002) 1059–1068
Tipping, M. (2000). The relevance vector machine. In S. A. Solla, T. K. Leen, & K.-R. Mu¨ller (Eds.), Advances in neural information processing systems (Vol. 12) (pp. 652–658). Cambridge: MIT Press. Tsay, M.-K., Shyu, K.-H., & Chang, P.-C. (1999). Feature transformation with generalized learning vector quantization for hand-written Chinese character recognition. IEICE Transactions on Information and Systems, E82-D(3), 687–692. Villmann, T. (1999). Benefits and limits of the self-organizing map and its variants in the area of satellite remote sensoring processing. Proceedings of European Symposium on Artificial Neural Networks (ESANN’99) (pp. 111– 116). Brussels, Belgium: D facto publications.
Villmann, T., & Merenyi, E. (2001). Extensions and modifications of the Kohonen-SOM and applications in remote sensing image analysis. In U. Seiffert, & L. C. Jain (Eds.), Self-organizing maps. Recent advances and applications (pp. 121–145). Berlin: Springer. Villmann, T., Der, R., Herrmann, M., & Martinetz, T. (1997). Topology preservation in self-organizing feature maps: Exact definition and measurement. IEEE Transactions on Neural Networks, 8(2), 256 –266. Wienholt, W. (1996). Entwurf Neuronaler Netze. Frankfurt/M., Germany: Verlag Harri Deutsch.