Determining Relevant Input Dimensions for the Self ... - CiteSeerX

3 downloads 0 Views 2MB Size Report
Abstract. We propose a method to determine the relevance of the different input dimensions for a self organizing map (SOM). First, a growing self organizing.
Determining Relevant Input Dimensions for the Self Organizing Map Thorsten Bojer , Barbara Hammer , Marc Strickert , and Thomas Villmann  ½

Department of Mathematics/Computer Science, University of Osnabr¨uck D-49069 Osnabr¨uck, Germany [email protected] ¾ Clinic for Psychotherapy and Psychosomatic Medicine, University of Leipzig, Karl-Tauchnitz-Straße 25, D-04107 Leipzig, Germany, [email protected] Abstract. We propose a method to determine the relevance of the different input dimensions for a self organizing map (SOM). First, a growing self organizing map is adapted to the data. Afterwards, the effect of the input dimensions on the clustering or the topology of the SOM, respectively, is computed and the data dimensions which are ranked low are pruned. The algorithm is applied to real life satellite image data. The results are verified via visualizing the data in RGBimages as well as explicitely computing the classification error.

1 Introduction Kohonen’s self organizing map provides a powerful and biologically motivated tool which computes a topologically adequate clustering of training data [8]. For this purpose, a lattice of neurons is adapted to the data such that the lattice which is chosen a priori fits the unknown topology of the data as good as possible. There exist various possibilities of further processing the SOM’s output: a classification can be approximated by means of attaching labels or local linear maps to the single neurons; substituting the data by their nearest neurons provides a compact representation of the data; the lattice of the SOM allows data visualization as well as structured data mining [6–8, 11, 13]. Hence SOMs are widely used for both, supervised and unsupervised learning tasks in robotics, image processing, data mining, and other areas of applications. One crucial property of SOMs consists in the fact that they compute a topologically adequate representation of the data assumed the lattice is chosen appropriately. Hence various adaptations and modifications of the standard SOM exist in order to measure the degree of topology preservation or to adapt the lattice structure to the data [1–3, 9, 14].  Often, data are high dimensional. Satellite remote sensing image data comprise dimensions according to the spectra provided by the Landsat Thematic Mapper up to several hundred dimensions for hyperspectral imagers. The larger the dimensionality, the more storing capacities and processing time are required for data processing. Furthermore, adequate visualization and data mining becomes difficult for high dimensional data. Hence input pruning, i.e. determining which data dimensions are relevant and which dimensions can be dropped for further processing, is a highly interesting topic [15]. This task is particularly challenging if unsupervised training methods like SOMs are considered. They do not provide a clear objective like the prediction error. It is necessary to find intrinsic characteristics of the unsupervised network instead. We

will deal with pruning methods for SOMs and their application to Landsat TM data in the following. We use a growing SOM (GSOM) for training in order to guarantee an adequate topology. We measure the importance of the input dimensions on the clustering or the topological structure, respectively. Afterwards, we prune the irrelevant dimensions in labeled satellite data for which the error can be quantified.

2 The Self Organizing Map

Assume a finite set of training data    Ê is given. We denote the components of a vector by  as proposed by Kohonen [8] consists  . A SOM   Ê together with a neighborhood of a set of neurons or codebooks structure of the neurons. We write  iff codebooks  and  are neighbored. Denote by   the minimum length of a path from  to  in this graph. Often, the neighborhood graph has a regular low dimensional grid structure which can be specified by a tuple    , denoting the grid dimensionality and  denoting the number     of neurons in the respective grid dimension. Denote by            the receptive field of the th codebook. After determining a neighborhood graph, the codebooks are trained with the recursive update



       

           

 



  











     e   ¼    where   is chosen at random,   denotes the winner neuron, i.e.    ¼ , and and are positive learning rates which are often decreased during training in order to for all 

ensure convergence. This algorithm spreads the codebooks onto the data such that the topology of the data matches the topology of the lattice of neurons. Roughly speaking, topology preservation means that neurons number and are neighbored in the lattice of neurons, i.e.  , iff the codebooks  and  are neighbored codebooks in the data manifold, i.e.    . There exist various approaches in order to state an exact definition or to compute the degree of neighborhood preservation in an efficient way [1–3, 9, 14]. Obviously, a faithful representation is possible if and only if the lattice of neurons which is chosen a priori fits the unknown topology of the training data. The GSOM generalizes the above learning algorithm such that the lattice is determined during training [2]. Starting from a minimum lattice, codebooks are added to an already existing lattice dimension or a new lattice dimension is attached during training depending on the observed deviation of the training data from the codebooks. GSOM has the advantage of preserving a regular grid structure whereby guaranteeing a maximum degree of topology preservation.

 











3 Relevance Measures The experiments in [5, 7, 10] indicate that a considerable reduction of the input dimensionality is possible without loosing much information in several applications. At the same time, theoretical results as proved in [12] restrict the possibility of efficiently finding reduced representations for general data. Hence we have to rely on heuristics adapted to the specific situation we are dealing with. We propose pruning algorithms for SOM which are related to methods proposed for the neural gas algorithm and successfully applied to artificial data in [5]. We define a relevance function   Ê such that a high value   indicates that dimension is relevant, a low value   indicates that dimension could be pruned. The dimensions with lowest values   are pruned. There exist various possibilities of defining appropriately:

 









      

Dispersion: We measure to which extend the variation around the codebook vectors is reduced in the respective dimension; if the variation is considerably reduced then the respective input dimension is important:



     



      



         

Weight function: The SOM is described by the mapping of an input vector to the winning codebook vector; this mapping can be approximated using softmin via





  ¾  

 e   ¾   e  where   . The effect of dimension  concerning this mapping can be measured via the derivative with respect to the  th input dimension. The larger this value, the more important is the respective dimension. We denote this measure by    . Topology preservation: The fact that the receptive fields of   and   intersect is – for reasonably well behaved data – equivalent to the fact that the point in between,     , is closest to  and   . This test can be approximated by the test sgd              

  

sgd being the standard sigmodial function. Hence those dimensions are not important which change the above value only slightly. In this case, only those receptive fields intersect which neurons are neighbored in the lattice of neurons. Hence the th component is important if the derivative of the above term with respect to the th input dimension is large. We denote this measure by   . Note that it only depends on the lattice of the SOM and not on the training data. Hence it can be computed very efficiently. Moreover, we can further reduce the above sum to only those neurons 

which are close to the neurons or , since a change in the topology will most likely  where demanifest in local changes of the neighborhood structure. We refer to  notes the maximum distance of neuron from neurons and in the neighborhood graph .

 

 

 











  

4 Application to Satellite Data Note that the above measures for the significance of the different input dimensions do not refer to a supervised learning task but rely on intrinsic characteristics of the SOM. We test these methods on a LANDSAT-TM satellite image from the Colorado area, U.S.A., for which a complete labeling of the inputs into  different classes of vegetation is available 1 . Hence we can compare the results obtained via unsupervised methods explicitely with the classification error on the data. The data comprises six input dimensions and about mio. data points. Since the above methods crucially depend on the fact that the topology of the SOM fits the data topology, we use a GSOM approach which leads to a lattice with     1

Thanks to M.Augusteijn (University of Colorado) for providing the data

Table 1. Ranking of the input dimensions induced by the various significance measures.

½ ¾ ¿ ½¿ ¾¿

1 0.81 0.09 0.49 1.12 1.04 1.12

2 0.82 0.15 0.46 0.69 0.64 0.69

3 0.82 0.15 0.48 0.53 0.49 0.53

4 0.66 0.62 1.0 3.0 3.0 3.0

5 0.81 0.12 0.54 1.79 1.62 1.79

6 0.84 0.11 0.5 1.25 1.12 1.25

ranking 244126 622145 465123 456123 456123 456123

neurons. The topographic product which computes the degree of topology preservation (a value  indicates perfect agreement) yields a value of   hence a nearly perfect fit [1]. This finding agrees with the results of a Grassberger-Procaccia analysis which yields an intrinsic data dimension of  . A standard PCA computes the eigenvalues            [4]. Hence one intrinsic data dimension is dominant, dimensions contain further relevant information, where these intrinsic dimensions do not necessarily coincide with the Euclidian coordinates. Fig. 1 displays the labeled image and a RGB-representation based on the unsupervised SOM only – each neuron in the SOM represents one color with RGB values according to the position in the three dimensional lattice; all data points in the receptive field of a neuron are displayed in the same color as the neuron. Although no label information has been used for the second representation one can observe a good agreement of the two images. We compute a posterior labeling of the codebooks depending on the data:  is equipped with the vector      ,  denoting the number of different labels, where                 denotes the percentage of points in the receptive field of  labeled with . Those label with maximum  is attached to the codebook  . Hence the SOM induces a mapping which maps an input to the label of the closest codebook vector. This function misclassifies a percentage of   of the data. We refer to this accuracy via . Note that a large number of misclassifications is due to the fact that the class boundaries lie within the receptive fields of the codebooks since no label information is used for training. Additionally to the above relevance functions ,  ,  ,  , and  , we further measure the effect of pruning the th input dimension on the labeling function if the percentage of misclassifications is measured according to . The results together with the induced ranking of the input dimensions are collected in Tab. 1. All methods indicate that dimension is most important. An RGB image based on this dimension only is shown in Fig. 1. Moreover, the induced ranking does not differ much between the various relevance measures. In particular, the very efficient measure  , which depends on neighbored neurons and their immediate neighborhood only, provides an accurate estimation of  , which depends on the whole neighborhood, as well as  , which additionally depends on the data. Iteratively pruning the dimensions ranked low allows us to drop all but two dimensions and still obtaining an accuracy of more than  as depicted in Fig. 1. The RGB-images which are obtained after pruning some of the input dimensions proposed by  are displayed in Fig. 1. Since an explicit labeling is usually not available for unsupervised methods, we need an adequate stopping criterion for input pruning. Note that the relevance factors according to  form well separated clusters: dimensions  ,  ,  , and  , hence stopping after pruning all dimensions from a specific cluster together with





         









     

  









    



















some prior estimation of the intrinsic dimensionality of the data could be a reasonable and efficient strategy for a pruning method.

5 Conclusions The presented pruning methods provide a robust tool for efficiently determining relevant input dimension for an unsupervised SOM. They rely on intrinsic characteristics of the SOM and hence do not require explicit label information of the data. Therefore they are particularly suitable for data mining applications or visualization of data with an unknown structure. An application to real life satellite data for which an additional labeling is available showed promising results. In particular, pruning according to topology preservation is a very effective method which, additionally, proposed a natural stopping point in combination with prior dimensionality estimations. Further work has to be done on adequate stopping criteria for the other relevance measures.

References 1. H.-U. Bauer and K. R. Pawelzik. Quantifying the neighborhood preservation of SelfOrganizing Feature Maps. IEEE Trans. on Neural Networks, 3(4):570–579, 1992. 2. H.-U. Bauer and T. Villmann. Growing a Hypercubical Output Space in a Self–Organizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218–226, 1997. 3. B. Fritzke. Growing grid: a self-organizing network wirh constant neighborhood range and adaptation strength. Neural Processing Letters, 2(5):9–13, 1995. 4. P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica, 9D:189–208, 1983. 5. B. Hammer and T. Villmann. Input pruning for neural gas architectures. To appear at ESANN’01. 6. T. Honkela, S. Kaski, T. Kohonen, and K. Lagus. Self-organizing maps of very large document collections: Justification for the WEBSOM method. In I. Balderjahn, R. Mathar, and M. Schader, editors, Classification, Data Analysis, and Data Highways, pages 245–252. Springer, Berlin, 1998. 7. S. Kaski. Dimensionality reduction by random mapping: fast similarity computation for clustering. In Proceedings of IJCNN’92, pages 413–418, 1998. 8. T. Kohonen. Self-Organizing Maps. Springer, 1997. 9. T. Martinetz and K. Schulten. Topology representing networks. Neural Networks, 7(3):507– 522, 1993. 10. U. Matecki. Automatische Merkmalsauswahl f¨ur Neuronale Netze mit Anwendung in der pixelbezogenen Klassifikation von Bildern. Shaker, 1999. 11. A. Meyering and H. Ritter. Learning 3D-shape-perception with local linear maps. In Proceedings of IJCNN’92, pages 432–436, 1992. 12. R. Nock and M. Sebban. Sharper bounds for the hardness of prototype and feature selection. In H. Arimura, S. Jain, and A. Sharma, editors, Algorihmic Learning Theory, pages 224–237. Springer, 2000. 13. A. Ultsch. Self-organizing neural networks for visualization and classification. In O. Opitz, B. Lausen, and R. Klar, editors, Information and Classification, pages 307–313, London, UK, 1993. Springer. 14. T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in Self– Organizing Feature Maps: Exact Definition and Measurement. IEEE Transactions on Neural Networks, 8(2):256–266, 1997. 15. T. Villmann and E. Merenyi. Extensions and modifications of SOM and its application in satellite remote sensoring processing. In H. Bothe and R.Rojas, editors, Neural Computation 2000, pages 765–771, Z¨urich, 2000. ICSC Academic Press.

"accuracy" "dispersion" "weight_function" "topology"

1 0.8 0.6 0.4 0.2 0 0

1

2

3

4

5

Fig. 1. First row - left: Classification according to the labeling, colors reslut from a specific color map; first row - right: RGB-visualization according to the SOM; second row - left: RGBvisualization with pruned dimensions and ; second row - right: RGB-visualization with pruned dimensions , , , and ; third row - left: RGB-visualization based on dimension , only; third row - right: decrease of the accuracy if the input dimensions according to the various measures are pruned.

Suggest Documents