Large Datasets Visualization with Neural Network Using Clustered Training Data Serg˙ejus Ivanikovas1,2 , Gintautas Dzemyda1,2 , Viktor Medvedev1,2 1
Institute of Mathematics and Informatics, Akademijos 4, LT-08663 Vilnius, Lithuania 2 Vilnius Pedagogical University, Student¸ u 39, LT-08106 Vilnius, Lithuania
[email protected],
[email protected],
[email protected]
Abstract. This paper presents the visualization of large datasets with SAMANN algorithm using clustering methods for initial dataset reduction for the network training. The visualization of multidimensional data is highly important in data mining because recent applications produce large amount of data that need specific means for the knowledge discovery. One of the ways to visualize multidimensional dataset is to project it onto a plane. This paper analyzes the visualization of multidimensional data using feed-forward neural network. We investigate an unsupervised backpropagation algorithm to train a multilayer feedforward neural network (SAMANN) to perform the Sammon‘s nonlinear projection. The SAMANN network offers the generalization ability of projecting new data. Previous investigations showed that it is possible to train SAMANN using only a part of analyzed dataset without the loss of accuracy. It is very important to select proper vector subset for the neural network training. One of the ways to construct relevant training subset is to use clustering. This allows to speed up the visualization of large datasets.
1
Introduction
The research area of this work is the analysis of large multidimensional datasets. Visualization techniques are especially relevant to multidimensional data, the analysis of which is limited by human perception abilities. Visualization is a useful tool for data analysis, especially when the data is unknown. However, when the dimension is huge, to produce robust visualization is difficult. Therefore, the dimensional reduction technique is needed. The goal of the projection method is to represent the input data items in a lower-dimensional space so that certain properties of the structure of the data set were preserved as faithfully as possible. The projection can be used to visualize the data set if a sufficiently small output dimensionality is chosen. A deep review of the dimensionality reduction methods is performed in [4]. This paper focuses on dimensionality reduction methods as tool for the visualization of large multidimensional datasets.
Several approaches have been proposed for reproducing nonlinear higherdimensional structures on a lower-dimensional display. The most common methods allocate a representation of each data point in a lower-dimensional space and try to optimize these representations so that the distances between them are as similar as possible to the original distances of the corresponding data items. The methods differ in that how the different distances are weighted and how the representations are optimized. Multidimensional scaling (MDS) refers to a group of methods that is widely used [3]. Nowadays multidimensional scaling means any method searching for a low (in particular two) dimensional representation of multidimensional data sets. The starting point of MDS is a matrix consisting of pairwise dissimilarities of the data vectors. The goal of projection in the metric MDS is to optimize the representations so that the distances between the items in the lower-dimensional space would be as close to the original distances as possible. Denote the distance between the vectors Xi and Xj in the feature space Rn by d∗ij , and the distance between the same vectors in the projected space Rd by dij . In our case, the initial dimensionality is n, and the resulting one (denote it by d) is 2. The metric MDS tries to approximate dij by d∗ij . Usually the Euclidean distances are used for dij and d∗ij . A particular case of the metric MDS is Sammon’s mapping [14]. It tries to optimize a cost function that describes how well the pairwise distances in a data set are preserved: ES =
1 m P
i,j=1;i