Dimension Reduction and Data Visualization ... - Semantic Scholar

3 downloads 18537 Views 467KB Size Report
dimension and data visualization, too. The stress is put on combining the self- organizing map (SOM) and Sammon mapping and on the neural network for.
Dimension Reduction and Data Visualization Using Neural Networks Gintautas DZEMYDA 1, Olga KURASOVA, Viktor MEDVEDEV Institute of Mathematics and Informatics, Vilnius, Lithuania Abstract. The problem of visual presentation of multidimensional data is discussed. The projection methods for dimension reduction are reviewed. The chapter deals with the artificial neural networks that may be used for reducing dimension and data visualization, too. The stress is put on combining the selforganizing map (SOM) and Sammon mapping and on the neural network for Sammon’s mapping SAMANN. Large scale applications are discussed: environmental data analysis, statistical analysis of curricula, comparison of schools, analysis of the economic and social conditions of countries, analysis of data on the fundus of eyes and analysis of physiological data on men’s health. Keywords. Visualization, dimension reduction, neural networks, SOM, SAMANN, data mining.

Introduction For the effective data analysis, it is important to include a human in the data exploration process and combine flexibility, creativity, and general knowledge of the human with the enormous storage capacity and computational power of today’s computer. Visual data mining aims at integrating the human in the data analysis process, applying human perceptual abilities to the analysis of large data sets available in today’s computer systems [1]. Objects from the real world are frequently described by an array of parameters (variables, features) x1 , x2 ,..., xn . The term “object” can cover various things: people, equipment, produce of manufacturing, etc. Any parameter may take some numerical values. A combination of values of all parameters characterizes a particular object X j = ( x j1 , x j 2 ,..., x jn ) from the whole set X 1 , X 2 ,..., X m , where n is the number of parameters and m is the number of analysed objects. X 1 , X 2 ,..., X m are the ndimensional vectors (data). Often they are interpreted as points in the n-dimensional space R n , where n defines the dimensionality of the space. In fact, we have a table of numerical data for the analysis: {x ji , j = 1,..., m, i = 1,..., n} . A natural idea arises to present multidimensional data, stored in such a table, in some visual form. It is a complicated problem followed by extensive researches, but its solution allows the human to gain a deeper insight into the data, draw conclusions, and directly interact with the data. In Figure 1, we present an example of visual presentation of the data table (n=6, m=20) using multidimensional scaling method, discussed below. The dimensionality of data is reduced from 6 to 2. Here vectors X 4 , X 6 , X 8 , and X 19

_____________________________ 1

Corresponding Author: Institute of Mathematics and Informatics, Akademijos Str. 4, LT 08663, Vilnius, Lithuania; E-mail: [email protected].

form a separate cluster that can be clearly observed visually on a plane and that cannot be recognized directly from the table without a special analysis. This chapter is organized as follows. In Section 1, an overview of methods for multidimensional data visualization via dimension reduction is presented. The methods based on neural networks are discussed, too. Section 2 deals with a problem of combining the SOM and Sammon mapping. Investigations of the neural network for Sammon’s mapping SAMANN are discussed in Section 3. Applications of the dimension reduction and data visualization using neural networks are presented in Section 4. They cover the visual analysis of correlations: environmental data analysis and statistical analysis of curricula; multidimensional data visualization applications: comparison of schools and analysis of the economic and social conditions of Central European countries; visual analysis of medical data: analysis of data on the fundus of eyes and analysis of physiological data on men’s health. Conclusions generalize the results.

Figure 1. Visualization power

1. Overview of the Dimension Reduction Methods We discuss possible approaches in visualizing the vectors X 1 ,..., X m Î R n below. A large class of methods has been developed for the direct data visualization. It is a graphical presentation of the data set providing a quality understanding of the information contents in a natural and direct way: parallel coordinates, scatterplots, Chernoff faces, dimensional stacking, etc. (see [2], [3]). Another way is to reduce the dimensionality of data. There exist a lot of so-called projection methods that can be used for reducing the dimensionality, and, particularly, for visualizing the ndimensional vectors X 1 ,..., X m Î R n . A deep review of the methods is performed e.g. by Kaski [4], Kohonen [5], and Kurasova [6]. The discussion below is based mostly on

these reviews. The discussion shows the place of neural networks in the general context of methods for reducing the dimensionality of data. The goal of projection method is to represent the input data items in a lowerdimensional space so that certain properties of the structure of the data set were preserved as faithfully as possible. The projection can be used to visualize the data set if a sufficiently small output dimensionality is chosen. One of these methods is a principal component analysis (PCA). The well-known principal component analysis [7] can be used to display the data as a linear projection on such a subspace of the original data space that best preserves the variance in the data. Effective algorithms exist for computing the projection, even neural algorithms (see e.g. [8], [9]). The PCA cannot embrace nonlinear structures, consisting of arbitrarily shaped clusters or curved manifolds, since it describes the data in terms of a linear subspace. Projection pursuit [10] tries to express some nonlinearities, but if the data set is high-dimensional and highly nonlinear, it may be difficult to visualize it with linear projections onto a lowdimensional display even if the “projection angle” is chosen carefully. Several approaches have been proposed for reproducing nonlinear higherdimensional structures on a lower-dimensional display. The most common methods allocate a representation for each data point in a lower-dimensional space and try to optimize these representations so that the distances between them are as similar as possible to the original distances of the corresponding data items. The methods differ in that how the different distances are weighted and how the representations are optimized. Multidimensional scaling (MDS) refers to a group of methods that is widely used [11]. The starting point of MDS is a matrix consisting of pairwise dissimilarities of the data vectors. In general, the dissimilarities need not be distances in the mathematically strict sense. There exists a multitude of variants of MDS with slightly different cost functions and optimization algorithms. The first MDS for metric data was developed in the 1930s: historical treatments and introductions to MDS have been provided by, for example, [12], [13], and later on generalized for analysing nonmetric data. The MDS algorithms can be roughly divided into two basic types: metric and nonmetric MDS. The goal of projection in the metric MDS is to optimize the representations so that the distances between the items in the lower-dimensional space would be as close to the original distances as possible. Denote the distance between the vectors X i and X j in the feature space R n by dij* , and the distance between the same vectors in the projected space R d by dij . In our case, the initial dimensionality is n, and the resulting one (denote it by d) is 2. The metric MDS tries to approximate dij by dij* . If a squareerror cost is used, the objective function to be minimized can be written as m

EMDS = å wij (dij* - dij )2 .

(1)

i , j =1 i< j

The weights are frequently used: wij =

1 m

å

k ,l =1 k