Scalable Data Clustering: A Sammon’s Projection Based Technique for Merging GSOMs Hiran Ganegedara and Damminda Alahakoon Cognitive and Connectionist Systems Laboratory, Faculty of Information Technology, Monash University, Australia 3800 {
[email protected],
[email protected]} http://infotech.monash.edu/research/groups/ccsl/ Abstract. Self-Organizing Map and Growing Self-Organizing Map are widely used techniques for exploratory data analysis. The key desirable features of these techniques are applicability to real world data sets and the ability to visualize high dimensional data in low dimensional output space. One of the core problems of using SOM/GSOM based techniques on large datasets is the high processing time requirement. A possible solution is the generation of multiple maps for subsets of data where the subsets consist of the entire dataset. However the advantage of topographic organization of a single map is lost in the above process. This paper proposes a new technique where Sammon’s projection is used to merge an array of GSOMs generated on subsets of a large dataset. We demonstrate that the accuracy of clustering is preserved after the merging process. This technique utilizes the advantages of parallel computing resources. Key words: Sammon’s projection, growing self organizing map, scalable data mining, parallel computing
1
Introduction
Exploratory data analysis is used to extract meaningful relationships in data when there is very less or no priori knowledge about its semantics. As the volume of data increases, analysis becomes increasingly difficult due to the high computational power requirement. In this paper we propose an algorithm for exploratory data analysis of high volume datasets. The Self-Organizing Map (SOM)[1] is an unsupervised learning technique to visualize high dimensional data in a low dimensional output spacel. The key issue with increasing data volume is the high computational time requirement since the time complexity of the SOM is in the order of O(n2 ) in terms of the number of input vectors n[2]. Another challenge is the determination of the shape and size of the map. Due to the high volume of the input, identifying suitable map size by trial and error may become impractical. A number of algorithms have been developed to improve the performance of SOM on large datasets. The Growing Self-Organizing Map (GSOM)[3] is an extension to the SOM algorithm
2
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
where the map is trained by starting with only four nodes and new nodes are grown to accommodate the dataset as required. A number of serial algorithms have been proposed for large scale data analysis using SOM[4][5], such algorithms tend to perform less efficiently as the input data volume increases. Thus several parallel algorithms for SOM and GSOM have been proposed in [2][6] and [7]. [2] and [6] are developed to operate on sparse datasets, with the principal application area being textual classification. In addition, [6] needs access to shared memory during the SOM training phase. Both [2] and [7] rely on an expensive initial clustering phase to distribute data to parallel computing nodes. In [7], a merging technique is not suggested for the maps generated in parallel. Here, we develop a generic scalable GSOM data clustering algorithm which can be trained in parallel and merged using Sammon’s projection[8]. Sammon’s projection is a nonlinear mapping technique from high dimensional space to low dimensional space. GSOM training phase can be made parallel by partitioning the dataset and training a GSOM on each partition. Sammon’s projection is used to merge the separately generated maps. The algorithm can be scaled to work on several computing resources in parallel and therefore can utilize the processing power of parallel computing platforms. The resulting merged map is refined to remove redundant nodes that occur due to the data partitioning method. This paper is organized as follows. Section 2 describes GSOM and Sammon’s Projection algorithms, the literature related to the work presented in this paper. Section 3 describes the proposed algorithm in detail and Section 4 describes the results and comparisons. The paper is concluded with Section 5, stating the implications of this work and possible future enhancements.
2 2.1
Background Growing Self-Organizing Map
A key decision in SOM is the determination of the size and the shape of the map. In order to determine these parameters, some knowledge about the structure of the input is required. Otherwise trial and error based parameter selection can be applied. SOM parameter determination could become a challenge in exploratory data analysis since structure and nature of input data may not be known. The GSOM algorithm is an extension to SOM which addresses this limitation. The GSOM starts with four nodes and has two phases, a growing phase and a smoothing phase. In the growing phase, each node accumulates an error value determined by the distance between the BMU and the input vector. When the accumulated error is greater than the growth threshold, nodes are grown if the BMU is a boundary node. The growth threshold GT is determined by the spread factor SF and the number of dimensions D. GT is calculated using GT = −D × ln SF . For every input vector, the BMU is found and the neighborhood is adapted.
(1)
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
2.2
3
Sammon’s Projection
Sammon’s projection is a nonlinear mapping algorithm from high dimensional space onto a low dimensional space such that topology of data is preserved. The Sammon’s projection algorithm attempts to minimize Sammon’s stress E over a number of iterations given by E = Pn−1 Pn µ=1
1
v=µ+1
d ∗ (µ, v)
×
n−1 X
n X [d ∗ (µ, v) − d(µ, v)]2 . d ∗ (µ, v) µ=1 v=µ+1
(2)
Sammon’s projection cannot be used on high volume input datasets due to its time complexity being O(n2 ). Therefore as the number of input vectors, n increases, the computational requirement grows exponentially. This limitation has been addressed by integrating Sammon’s projection with neural networks[9].
3
The Parallel GSOM Algorithm
In this paper we propose an algorithm which can be scaled to suit the number of parallel computing resources. The computational load on the GSOM primarily depends on the size of the input dataset, the number of dimensions and the spread factor. However the number of dimensions is fixed and the spread factor depends on the required granularity of the resulting map. Therefore the only parameter that can be controlled is the size of the input, which is the most significant contributor to time complexity of the GSOM algorithm. The algorithm consists of four stages, data partitioning, parallel GSOM training, merging and refining. Fig. 1 shows the high level view of the algorithm.
Fig. 1. High level view of the parallel GSOM algorithm
3.1
Data Partitioning
The input dataset has to be partitioned according to the number of parallel computing resources available. Two possible partitioning techniques are considered in the paper. First is random partitioning where the dataset is partitioned randomly without considering any property in the dataset. Random splitting could be used if the dataset needs to be distributed evenly across the GSOMs.
4
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
Random partitioning has the advantage of lower computational load although even spread is not always guaranteed. The second technique is splitting based on very high level clustering[10][7]. Using this technique, possible clusters in data can be identified and SOMs or GSOMs are trained on each cluster. These techniques help in decreasing the number of redundant neurons in the merged map. However the initial clustering process requires considerable computational time for very large datasets. 3.2
Parallel GSOM Training
After the data partitioning process, a GSOM is trained on each partition in a parallel computing environment. The spread factor and the number of growing phase and smoothing phase iterations should be consistent across all the GSOMs. If random splitting is used, partitions could be of equal size if each computing unit in the parallel environment has the same processing power. 3.3
Merging Process
Once the training phase is complete, output GSOMs are merged to create a single map representing the entire dataset. Sammon’s projection is used as the merging technique due to the following reasons. a. Sammon’s projection does not include learning. Therefore the merged map will preserve the accumulated knowledge in the neurons of the already trained maps. In contrast, using SOM or GSOM to merge would result in a map that is biased towards clustering of the separate maps instead of the input dataset. b. Sammon’s projection will better preserve topology of the map compared to GSOM as shown in results. c. Sammon’s projection performs faster than techniques with learning. Neurons generated in maps resulting from the GSOMs trained in parallel are used as input for the Sammon’s projection algorithm which is run over a number of iterations to organize the neurons in topological order. This creates a topology preserved merged map or the entire input dataset. 3.4
Refining Process
After merging, the resulting map is refined to remove any redundant neurons. In the refining process, nearest neighbor based distance measure is used to merge any redundant neurons. The refining algorithm is similar to [11] where, for each node in the merged map, the distance between the nearest neighbor coming from the same source map, d1 , and the distance between the nearest neighbor from the other maps, d2 , as described Eq. (3). Neurons are merged if d1 ≥ βeSF d2
(3)
where β is the scaling factor and SF is the spread factor used for the GSOMs.
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
4
5
Results
We used the proposed algorithm on several datasets and compared the results with a single GSOM trained on the same datasets as a whole. A multi core computer was used as the parallel computing environment where each core is considered a computing node. Topology of the input data is better preserved in Sammon’s projection than GSOM. Therefore in order to investigate the effect of Sammon’s projection, Sammon’s projection was performed on the map generated by the GSOM trained on the whole dataset and included in the comparison. 4.1
Accuracy
Accuracy of the proposed algorithm was evaluated using breast cancer Wisconsin dataset from [12]. Although this dataset may not be considered as large, it provides a good basis for cluster evaluation[13]. The parallel run was done on two computing nodes. Records in the dataset are classified as 65.5% benign and 34.5% malignant. The dataset was randomly partitioned to two segments containing 341 and 342 records. Two GSOMs were trained in parallel using the proposed algorithm and another GSOM was trained on the whole dataset. All the GSOM algorithms were trained using a spread factor of 0.1, 50 growing iterations and 100 smoothing iterations. Results were evaluated using three measures for accuracy, DB index, cross cluster analysis and topology preservation. DB Index. DB Index[14] was used to evaluate the clustering of the map for different numbers of clusters. √ K-means[15] algorithm was used to cluster the map for k values from 2 to n, n being the number of nodes in the map. For exploratory data analysis, DB Index is calculated for each k and the value of k for which DB Index is minimum, is the optimum number of clusters. Table 1. DB index comparison k
GSOM
GSOM with Sammon’s Projection
Parallel GSOM
2 3 4 5 6
0.400 0.448 0.422 0.532 0.545
0.285 0.495 0.374 0.381 0.336
0.279 0.530 0.404 0.450 0.366
Table 1 shows that the DB Index values are similar for different k values across the three maps. It indicates similar weight distributions across the maps. Cross Cluster Analysis. Cross cluster analysis was performed between two sets of maps. Table 2 shows how the input vectors are mapped to clusters of
6
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
GSOM and the parallel GSOM. It can be seen that 97.49% of the data items mapped to cluster 1 of the GSOM are mapped to cluster 1 of the parallel GSOM, similarly 90.64% of the data items in cluster 2 of the GSOM are mapped to the corresponding cluster in the parallel GSOM. The table also displays the cross cluster comparison between parallel GSOM and GSOM with sammon’s projection on the whole dataset. Table 2. Cross cluster comparison GSOM Parallel Cluster 1 GSOM Cluster 2
GSOM with Sammon’s Projection
Cluster 1
Cluster 2
Cluster 1
Cluster 2
97.49%
9.36
98.09%
8.1%
2.51%
90.64
1.91%
91.9%
Topology Preservation. A comparison of the degree of topology preservation of the three maps are shown in Table 3. Topographic product[16] is used as the measure of topology preservation. It is evident that maps generated using Sammon’s projection have better topology preservation leading to better results in terms of accuracy. However the topographic product scales nonlinearly with the number of neurons. Although it may lead to inconsistencies, this provides a reasonable measure to compare topology preservation in the maps. Table 3. Topographic product GSOM
GSOM with Sammon’s Projection
Parallel GSOM
-0.01529
0.00050
0.00022
Similar results were obtained for other datasets, for which results are not shown due to space constraint. Fig. 2 shows clustering of GSOM, GSOM with Sammon’s projection and the parallel GSOM. It is clear that the map generated by the proposed algorithm is similar in topology to the GSOM and the GSOM with Sammon’s projection.
Fig. 2. Clustering of maps for breast cancer dataset
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
4.2
7
Performance
The key advantage of a parallel algorithm over a serial algorithm is faster performance. We used a dual core computer as a the parallel computing environment where two threads can simultaneously execute in the two cores. The execution time decreases exponentially with the number of computing nodes available. Execution time of the algorithm was compared using three datasets, breast cancer dataset used for accuracy analysis, the mushroom dataset from[12] and muscle regeneration dataset (9GDS234) from [17]. The mushroom dataset has 8124 records and 22 categorical attributes which resulted in 123 attributes when converted to binary. The muscle regeneration dataset contains 12488 records with 54 attributes. The mushroom and muscle regeneration datasets provided a better view of the algorithms performance for large datasets. Table 4 summarizes the results for performance n terms of execution time. Fig. 3 shows the results in a graph. Table 4. Execution Time
GSOM Parallel GSOM
Breast cancer
Mushroom
Microarray
4.69 2.89
1141 328
1824 424
Fig. 3. Execution time graph
5
Discussion
We propose a scalable algorithm for exploratory data analysis using GSOM. The proposed algorithm can make use of the high computing power provided by parallel computing technologies. This algorithm can be used on any real-life dataset without any knowledge about the structure of the data. When using SOM to cluster large datasets, width and hight of the map should be specified. These parameters may or may not suite the dataset for optimum clustering. This is especially the case with the proposed technique due to the user having to specify suitable SOM size and shape for selected data subsets. In the case for
8
Using Sammon’s Projection to Merge GSOMs for Scalable Data Clustering
large scale datasets, using a trial and error based width and hight selection may not be possible. GSOM has the ability to grow the map according to the structure of the data. Since the same spread f actor is used across all subsets, comparable GSOMs will be self generated with data driven size and shape. As a result, although it it possible to use this technique on SOM, it is more appropriate for GSOM. It can be seen that the proposed algorithm is several times efficient than the GSOM and gives the similar results in terms of accuracy. The efficiency of the algorithm grows exponentially with the number of parallel computing nodes available. As a future development, the refining method will be fine tuned and the algorithm will be tested on a distributed grid computing environment.
6
References
1. Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9) (1990) 1464–1480 2. Roussinov, D., Chen, H.: A scalable self-organizing map algorithm for textual classification: A neural network approach to thesaurus generation. Communication Cognition and Artificial Intelligence 15(1-2) (1998) 81111 3. Alahakoon, D., Halgamuge, S., Srinivasan, B.: Dynamic self-organizing maps with controlled growth for knowledge discovery. Neural Networks, IEEE Transactions on 11(3) (2000) 601–614 4. Ontrup, J., Ritter, H.: Large-scale data exploration with the hierarchically growing hyperbolic som. Neural networks 19(6-7) (2006) 751–761 5. Feng, Z., Bao, J., Shen, J.: Dynamic and adaptive self organizing maps applied to high dimensional large scale text clustering, IEEE (2010) 348–351 6. Lawrence, R., Almasi, G., Rushmeier, H.: A scalable parallel algorithm for selforganizing maps with applications to sparse data mining problems. Data Mining and Knowledge Discovery 3(2) (1999) 171–195 7. Zhai, Y., Hsu, A., Halgamuge, S.: Scalable dynamic self-organising maps for mining massive textual data, Springer 260–267 8. Sammon Jr, J.: A nonlinear mapping for data structure analysis. Computers, IEEE Transactions on 100(5) (1969) 401–409 9. Lerner, B., Guterman, H., Aladjem, M., Dinsteint, I., Romem, Y.: On pattern classification with sammon’s nonlinear mapping an experimental study* 1. Pattern Recognition 31(4) (1998) 371–381 10. Yang, M., Ahuja, N.: A data partition method for parallel self-organizing map. Volume 3., IEEE 1929–1933 vol. 3 11. Chang, C.: Finding prototypes for nearest neighbor classifiers. Computers, IEEE Transactions on 100(11) (1974) 1179–1184 12. Frank, A., Asuncion, A.: UCI machine learning repository (2010) 13. Bennett, K., Mangasarian, O.: Robust linear programming discrimination of two linearly inseparable sets. Optimization methods and software 1(1) (1992) 23–34 14. Ahmad, N., Alahakoon, D., Chau, R.: Cluster identification and separation in the growing self-organizing map: application in protein sequence classification. Neural Computing & Applications 19(4) (2010) 531–542 15. Hartigan, J.: Clustering algorithms. John Wiley & Sons, Inc. (1975) 16. Bauer, H., Pawelzik, K.: Quantifying the neighborhood preservation of selforganizing feature maps. Neural Networks, IEEE Transactions on 3(4) (1992) 17. Edgar, R., Domrachev, M., Lash, A.: Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic acids research 30(1) (2002)