A Hybrid Parallel SOM Algorithm for Large Maps in ... - Semantic Scholar

0 downloads 0 Views 311KB Size Report
Abstract. We propose a method for a parallel implementation of the. Self-Organizing Map (SOM) algorithm, widely used in data-mining. We call this method ...
A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining Bruno Silva and Nuno Marques CENTRIA - Centro de Inteligˆencia Artificial, Faculdade de Ciˆencias e Tecnologia, Universidade Nova de Lisboa, Portugal [email protected] [email protected] http://centria.fct.unl.pt/

Abstract. We propose a method for a parallel implementation of the Self-Organizing Map (SOM) algorithm, widely used in data-mining. We call this method Hybrid in the sense that it combines the advantages of the common network-partition and data-partition approaches, and is particularly effective when dealing with large maps. Based on the fact that a global topological ordering of the map is achieved in a short period of time, the proposed method obtains this ordering during the initial epochs using the Batch Data-Partition algorithm. Here we calculate the input data histogram over the map, based on which the map is segmented and the respective input vectors redistributed equally. From now on, until new segmentation each node only processes their subset of samples in their region of the map. Our experimental results show an average speed-up of 1.27 compared to the classical Batch data-partition method, while maintaining the topological information of the maps. Key words: Self-Organizing Map, Parallel learning algorithms, DataMining

1

Introduction

Data mining involves sorting through large amounts of data and picking out relevant information. It is usually used by businesses, intelligence organizations, and financial analysts, but is increasingly used by scientists to extract information from the enormous data sets generated by modern experimental and observational methods. The Self-Organizing Map (SOM) is a neural network model, proposed by Kohonen [1] in 1982. It is used to project and visualize high-dimensional data on lower dimensions, typically 2D, while preserving the relationships among the input data, thus electing it as a powerful data-mining tool [2–5]. The current work focuses the application of the SOM algorithm to large maps and its parallelization, devising a more time-efficient method for the training process, while maintaining the topological properties of the maps. The SOM has proved to be one of the most robust and useful algorithms for classification and dimensionality reduction. However, the computation involved

2

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

in the algorithm is extremely high, and its application may become infeasible in terms of computational time, especially when dealing with large volumes of data and/or with large maps. Both of these factors must be taken into account since, nowadays, data-mining techniques have to deal with an increasing volume of information and large maps offer several advantages. These advantages range from the greater detail that can be obtained using a large number of neurons, in terms of information extraction and inspection, until the use of large maps to obtain emerging phenomena (Emergent SOM). In the later case, the map space is regarded as a tool for the characterization of the otherwise inaccessible high dimensional data space. A characteristic of this SOM usage is the large number of neurons. Thousands or tens of thousand neurons are used. Such SOM allow the emergence of intrinsic structural features of the data space [6–8]. The obvious drawback is the computational cost while training these very large maps. Therefore, the parallelization of the SOM algorithm is of great interest. In general, most parallel implementations are either network-partition or data-partition based. In the first, the parallelization is achieved by segmenting the original map in smaller sub maps. On the other hand, data-partitioning techniques break down the set of input data into smaller disjoint sets. This paper introduces the SOM algorithm and makes a survey of the common techniques used for parallelizing the SOM. Then, a proposal for a hybrid method that segments both the map and the input data is made, along with preliminary results that were obtained. These results show that an average speed-up of 1.27 is obtained, compared to the original Batch data-partition method.

2

The Self-Organizing Map

The Self-Organizing Map is a very popular artificial neural network (ANN) algorithm based on unsupervised learning. It is primarily used for the visualization of nonlinear relations of multidimensional data. Extensive research has been made and it has applications ranging from full text and financial data analysis, pattern recognition, image analysis and process monitoring [9]. The SOM is able to project high-dimensional data in a lower dimension, typically 2D. This nonlinear projection produces a 2D pattern map that can be useful in analyzing and discovering patterns in the input space. Consequently the SOM can be used to identify clusters for similar inputs. New inputs can then be analyzed and projected in the trained map, and assigned to an existing cluster. The artificial model of the neurons and network is presented in Figure 1. The network consists in a regular 2D grid of neurons, where the position of each neuron is fixed. Also, each neuron is identified by its location and weight vector. The neural connection strengths found in biological systems are represented by these weights, and each neuron is laterally connected to a subset of its neighbors. The SOM algorithm is based in the competitive learning concept where the neurons become gradually sensitive to different inputs, in a n-dimension input space. When an input pattern, from the training set, is presented to the network, a metric distance is computed for all weight vectors. The neuron with the

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

3

Fig. 1. The self-organizing map is a single layer feedforward network where the output neurons are arranged in low dimensional (usually 2D or 3D) grid. Each input x is connected to all output neurons. Attached to every neuron there is a weight vector Wk with the same dimensionality as the input vectors.

most similar weight vector to the input pattern is called the Best Matching Unit (BMU). The weights of the BMU and its neighbors are then adjusted towards the input pattern. The magnitude of the changes decreases with time and is smaller for neurons far away from the BMU. Maps produced by the SOM algorithm are characterized by the fact that weight vectors which are neighbors in the input space are mapped onto neighboring neurons. If the dimensionality of the input space and the network differ, it is impossible to preserve all similarity relationships among weight vectors in the input space; only the most important similarity relationships are preserved and mapped onto neighborhood relationships on the network of neurons, while the less important similarity relationships are not retained in the mapping. If the input space and network are of the same dimensionality, the SOM algorithm can preserve all the similarity relationships and generates a distorted, but topographic map, of the input space, where more important regions of the input space are represented with higher resolution [10]. 2.1

The Classical SOM Algorithm (Sequential or On-Line)

We assume a set of input vectors x ∈ Rn , in a n-dimensional space, and a weight vector W ∈ Rn for each neuron k in a regular 2D grid of K neurons. A discrete temporal index t exists, such that a vector x(t) is presented to the network in time t, and Wk (t) is the weight vector calculated for that instant. Input patterns are recycled during training; one single pass over the training set is called an epoch. The initial values for the weight vectors can be set randomly or using K different input patterns. In this method the reference vectors are updated immediately upon the presentation of an input pattern. The distance d from this input to the weight vector of the kth neuron is generally calculated using the Euclidean metric: dk (t) = kx(t) − Wk (t)k2

(1)

4

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

Then, the BMU, identified by the subscript c, is elected: dc (t) = min dk (t)

(2)

k

If multiple minima occurs, than one can be selected randomly for the winner. Next the weight vectors are updated using the following equation: Wk (t + 1) = Wk (t) + α(t)hck (t)[x(t) − Wk (t)]

(3)

where 0 < α(t) < 1 is the learning-rate factor that decreases monotonically with time. The scalar multiplier hck (t) is called the neighborhood function, and acts like a smoothing kernel over the grid. It is often taken the Gaussian for the neighborhood function: hck (t) = e

−krk −rc k2 σ(t)2

(4)

where rk and rc designate, respectively, the spatial coordinates of the neurons k and c. The width σ(t) of the neighborhood function also decreases monotonically over time, from a value no less than half the largest diagonal of the map to a value equivalent to the width of a single cell. For computational reasons, hck (t) is truncated when krk − rc k2 exceeds a certain limit. This process produces the SOM topology and upon convergence the weight vectors approximate a probability distribution function of the input space. Thus the sequential SOM algorithm can be formalized as follows: 1. Find the BMU for a input vector x(t), using equations (1) and (2) 2. Update the weight vectors of the neurons with equations (3) and (4) 3. Repeat from step 1 until some convergence criterion is met, with decreasing neighborhood kernel. 2.2

The Batch SOM Algorithm (Off-Line)

In the Batch SOM Algorithm [11] the updates are deferred to the end of a learning epoch (i.e. the presentation of the whole training set) and the new weights are computed using: Pt0 =tf Wk (tf ) =

0 0 t0 =t0 hck (t )x(t ) Pt0 =tf 0 t0 =t0 hck (t )

(5)

where t0 and tf stand for, respectively, for the beginning and end of the current epoch. The so-called ”winner” is found with the following equations: dk (t) = kx(t) − Wk (t0 )k2

(6)

dc (t) = min dk (t)

(7)

k

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

5

where Wk (t0 ) is the neuron weight vector calculated at the end of the previous epoch. The neighborhood function remains as presented in (4). The learning-rate factor is not present in the Batch method, thus eliminating a potential source of poor convergence. The Batch SOM algorithm can be stated as follows: 1. For each epoch: (a) Find the BMU for an input vector x(t), using (6) and (7) (b) Accumulate numerator and denominator of equation (5) for all neurons 2. Update neuron weights with equation (5) 3. Repeat from step 1 until some convergence criterion is met, with decreasing neighborhood kernel. The Batch method can only be applied when the whole set of input data is present and offers several advantages over the Sequential method. Since the weights are not updated immediately there is no dependency on the order that input vectors are presented to the network. This also eliminates the concern that the last input vectors presented influence the final results. Similarities between the Batch SOM Algorithm, K-Means clustering and LBG have been discussed in [12–16] and comparisons between the Sequential and Batch algorithms have been made in [17, 18].

3

Parallelizing the SOM

We must evaluate which approaches are best for parallelizing the SOM Algorithm. Both Sequential and Batch algorithms have been parallelized and both have advantages and disadvantages when parallelized. 3.1

Overview of Common Methods

The Sequential algorithm is commonly used with the Network-Partition approach. With it, each processing node as to deal with every input pattern, using only the part of the map that as been assign to it. Since the maps are smaller, the local search is done much faster. The main advantage of this approach is that it preserves the original weight update method (3), thus being able to reproduce exactly the results of the Sequential SOM algorithm. However it makes it necessary that all nodes communicate with each other at every iteration when finding the BMU for the current input pattern. This frequent communication limits scalability and introduces a constant latency in the processing of each input. Several implementations have been devised using this approach in different architectures [19–23]. Using the Data-Partition approach offers no clear advantages, because the processing nodes still need to communicate at every iteration, not to elect the BMU for an input pattern, but to update the state of the map. Furthermore, since the resulting map is dependent of the order in which the input patterns are presented to the network, this approach would have

6

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

different results with different data segmentations. As for the Batch algorithm the data-partitioning approach is particularly well suited, allowing greater scalability, since parallel granularity is determined by the volume of data, which is potentially large. The method partitions the set of input patterns equally along the nodes, and each node trains a complete copy of the map, needing only to communicate at the end of each epoch to update the map [24, 25, 17]. The Batch Data-Partitioning method offers a considerable speed-up in the learning process, against the Sequential Network-Partitioning, due to less inter-node communication. Some authors point out that the quality of the clustering produced by the Batch algorithm is inferior to the one of the Sequential algorithm [18], but using a k-Batch modification the results can be very similar, forcing the updates do be made k times during each epoch [17]. The results obtained show that there are differences between both resulting maps (Batch data-partition vs. Hybrid method), but they are minimal and the map topology is preserved. Thus, the information that can be extracted from the Batch map is exactly the same. We consider that these minor differences are acceptable, given the computing time that can be saved in the training process, making it the best solution for time-critical analysis. 3.2

Proposed Method and Architecture

Based on the fact that a global topological ordering of the map is achieved in a short period of time [26, 27], the proposed method obtains this ordering during the initial epochs using the Batch Data-Partition algorithm. After this stage of global ordering the BMUs, relative to each input pattern, are well distributed along the map. Here we proceed to the map segmentation, by calculating the input data histogram over the map. This histogram contains information about which input patterns the neurons represent best. Based on this histogram the map is segmented in P regions over P processing nodes, and the respective input patterns redistributed equally. From now on, until new segmentation each node only processes their subset of samples in their region of the map, although maintaining a full copy of the map for neighborhood updating. Consequently, the learning process is accelerated until the end of training. This method is based on the assumption that after the initial global ordering of the map, the location of the BMU for each input pattern is contained in a small area of the map, and generally in the region assigned to each node. Any error that can be made in calculating the winner neuron for a specific input pattern, i.e. it can be in a contiguous region, is attenuated in the next segmentation stage repeating the described process. Since the neighboring function is decreasing at each epoch, the updates to contiguous regions are less and less frequent. As a result it is impossible that the trained map is equal to the one obtained in the Data-Partition method, but results show that there are extremely similar and the information that can be extracted from them is preserved, justifying the method’s speedup for large-scale domain applications. The implementation uses a master/slave architecture where the coordinator node is responsible for partitioning the map and the input vectors, combining information received from the processing nodes

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

7

and maintaining the state of the trained map. The implementation functions in any personal computer network and most operating systems, since it was developed in the Java language, and has the ability of discovering and using any processing nodes available in the network. A small visualization package was also developed for the system. The pseudo-code of the algorithm is presented in Figure 2.

Coordinator: Initializes weight vectors Partitions set of input vectors Sends to each node p full set of input vectors and sample indexes of their subset t=0 for epoch=1 to Nepochs do Coordinator calculates new σ if segmentation epoch then Each node calculates sample histogram and sends it to the Coordinator Coordinator segments map and input vectors Coordinator sends updated map, new sample indexes and map Region to each node p else Coordinator sends updated map and σ to each node p end if Each node p resets numerator and denominator of eq. (5) for input=1 to Ninputs ∈ p do t=t+1 for k=1 to K ∈ Regionp do Calculate dk using eq. (6) end for Calculate BMU for input using eq. (7) for k=1 to K do Accumulate numerator and denominator in eq. (5) end for end for Each node p sends to the Coordinator its partial contributions for k=1 to K ∈ Coordinator do Update Wk in eq. (5) end for end for

Fig. 2. Pseudo-code for the proposed parallel hybrid method.

4

Experimental Results

The results presented in this section were obtained using the Adult Database dataset [28], consisting in a total of 30162 records, each with 15 attributes. The dataset was pre-processed and binary expansion was applied on the 9 categorical attributes, resulting in a total of 106 attributes. Numeric attributes were also normalized in the range of [0, 1]. In these preliminary results, only 6054 random records (20%) were used. The tests were made in a cluster holding 3 machines with AMD Athlon 64 X2 processor and 2GB of memory, linked by a 100Mbps interface, running GNU\Linux operating system.

8

4.1

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

Parameters Used

The parameters that affect the convergence of the Batch map are the σ value, that determines the width of a neuron’s neighborhood and the training duration, i.e. the number of epochs. In all cases, the maps were trained with 25 epochs. We used a descending exponential function for σ, described in Equation (8), where the initial (σ0 ) and final (σf ) values are chosen in a way such that the initial neighborhood covers the totality of the map, being reduced until a single neuron at the end of training. σ(ne ) = σ0 (

σf ( Nne ) ) e σ0

(8)

with √ σ0 =

K

σf = 0.2 where ne is the current epoch and Ne the total number of epochs. In the tests made with the proposed method, the global ordering phase had the duration of 5 epochs, after which segmentation occurred every other 5 epochs. 4.2

Local Batch vs. Hybrid Method

Here we show that the trained maps obtained with the Batch algorithm and the proposed method are very similar, preserving the topological information. This can be proved by visually inspecting the Component Planes of both maps. The Component Planes allow the visualization of individual attributes of the weight vectors in a color-scale manner. A darker color corresponds to a higher value of the attribute, while a light color means a lower value. By comparing several component planes simultaneously we can extract relationships between attributes and an overall interpretation of the trained map can be made. Figure 3 shows that the proposed hybrid method preserves the topological information, when compared to the original Batch algorithm running in a single machine, exhibiting minor differences. These differences are due to the fact that some input vectors may have their BMU in a contiguous region of the segmented map, during training. This can easily be solved by overlapping the region’s frontiers, while segmenting the map, since these cases occur mainly in these parts of the map. In result, the presented method can be used safely when a large map needs to be trained promptly and the quality of the map is not critical. Also, the speed-up obtained by our method, compared to the Batch algorithm running locally, in a map with 6400 neurons, was 3.34. 4.3

Data-Partition Method vs. Proposed Hybrid Method

We made some experiments in order to compare the speed-up of the proposed method to the original Batch data-partition method. These results are shown in Figure 4, and the average speed-up is ≈ 1, 27, meaning that this method is ≈ 27% faster than the data-partition method.

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

9

Fig. 3. Comparison of Component Planes obtained with the local Batch algorithm (left) and the proposed method, with 4 slave nodes (right). It can be seen that the topological information is preserved. Both maps, with a dimension of (16 ∗ 16), had the same initialization, by using the first K samples from the set of input vectors.

Fig. 4. Execution times for the Data-Partition method and the Hybrid method. The average speed-up is ≈ 1, 27.

10

5

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

Discussion and Conclusion

We present a method for parallelizing the SOM that is particularly effective when dealing with large maps. This method is named Hybrid in the sense that it combines the advantages of both network-partition and data-partition methods. The preliminary results are promising, since we achieved an average speed-up of 1.27, when compared to the original Batch data-partition method, while maintaining the topological information of the maps. This makes the proposed method well suited for training large maps in time-critical situations. The current implementation has the advantage of running in a common computer network and on top of most operating systems. In the immediate future, we intend to further optimize the implementation, in an attempt to reproduce exactly the results of the Batch algorithm. This can possibly be done by overlapping the frontiers of contiguous regions of the map, that each node processes. By doing this, we minimize the probability of the BMU for an input vector that one node processes is in the map region of another node. The amount of the overlapping will be in proportion to the width of the neighborhood kernel at the time of segmentation.

References 1. Kohonen, T.: Self-Organizing Maps. Springer-Verlag New York, Inc., Secaucus, NJ, USA (2001) 2. Vesanto, J.: Using SOM in data mining (2000) 3. Vesanto, J.: SOM-based data visualization methods. Intelligent-Data-Analysis 3 (1999) 111–26 4. Flexer, A.: On the use of self-organizing maps for clustering and visualization. In: Principles of Data Mining and Knowledge Discovery. (1999) 80–88 5. Himberg, J., Ahola, J., Alhoniemi, E., Vesanto, J., Simula, O.: The self-organizing map as a tool in knowledge engineering 6. Ultsch, A.: Data mining and knowledge discovery with emergent self-organizing feature maps for multivariate time series (1999) 7. Ultsch, A., Mrchen, F.: Esom-maps: tools for clustering, visualization, and classification with emergent som (2005) 8. Ultsch, A., Herrmann, L.: The architecture of emergent self-organizing maps to reduce projection errors. In: In Proceedings of the European Symposium on Artificial Neural Networks (ESANN 2005), Verleysen M. (Eds) (2005) 1–6 9. Oja, M., Kaski, S., Kohonen, T.: Bibliography of self-organizing map (som) papers: 1998-2001 addendum 10. Erwin, E., Obermayer, K., Schulten, K.: Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cyb. 67(1) (1992) 47–55 11. Kohonen, T.: Things you haven’t heard about the Self-Organizing Map. In: Proc. ICNN’93, International Conference on Neural Networks, Piscataway, NJ, IEEE, IEEE Service Center (1993) 1147–1156 12. Goddard, J., Martinez, A.E., Martinez, F.M., Aljama, T.: A comparison of different clustering algorithms for speech recognition. In: Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems. IEEE, Piscataway, NJ, USA. Volume 3. (2000) 1222–5

A Hybrid Parallel SOM Algorithm for Large Maps in Data-Mining

11

13. Kiang, M.Y., Kumar, A.: An evaluation of self-organizing map networks as a robust alternative to factor analysis in data mining applications. Information-SystemsResearch 12 (2001) 177–94 14. Lee, H.S., Younan, N.H.: Investigation into unsupervised clustering techniques. In: Conference Proceedings—IEEE SOUTHEASTCON, Piscataway, NJ, Mississippi State Univ, IEEE (2000) 124–130 15. Li, J., Manikopoulos, C.N.: Multi-stage vector quantization based on the selforganization feature maps. Visual Communications and Image Processing IV 1199 (1989) 1046–1055 16. McAuliffe, J.D., Atlas, L.E., Rivera, C.: A comparison of the LBG algorithm and Kohonen neural network paradigm for image vector quantization. In: Proc. ICASSP-90, International Conference on Acoustics, Speech and Signal Processing. Volume IV., Piscataway, NJ, IEEE, IEEE Service Center (1990) 2293–2296 17. Ncker, M., Mrchen, F., Ultsch, A.: An algorithm for fast and reliable esom learning. In: ESANN. (2006) 131–136 18. Fort, J.C., Letremy, P., Cottrel, M.: Advantages and drawbacks of the batch Kohonen algorithm. In: 10th-European-Symposium on Artificial Neural Networks. Esann’2002. Proceedings. 2002: 223-30. (2002) 19. Togneri, R., Attikiouzel, Y.: Parallel implementation of the Kohonen algorithm on transputer. In: Proc. IJCNN-91, International Joint Conference on Neural Networks, Singapore. Volume II., Los Alamitos, CA, IEEE Computer Society Press (1991) 1717–1722 20. Mann, R., Haykin, S.: A parallel implementation of Kohonen’s feature maps on the warp systolic computer. In: Proc. IJCNN-90, International Joint Conference on Neural Networks, Washington, DC. Volume II., Hillsdale, NJ, Lawrence Erlbaum (1990) 84–87 21. Manohar, M., Tilton, J.C.: Progressive vector quantization on a massively parallel SIMD machine with application to multispectral image data. IEEE Trans. on Image Processing 5(1) (January 1996) 142–147 22. Guan, H., kwong Li, C., yat Cheung, T., Yu, S.: Parallel design and implementation of SOM neural computing model in PVM environment of a distributed system. In: Proceedings of Advances in Parallel and Distributed Computing. IEEE Computer Society Press, Los Alamitos, CA, USA (1997) 26–31 23. Bandeira, N., Lobo, V.J., Moura-Pires, F.: Training a self-organizing map distributed on a PVM network. In: 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence. Volume 1. IEEE, New York, NY, USA (1998) 457–61 24. Openshaw, S., Turton, I.: A parallel kohonen algorithm for the classification of large spatial datasets. Comput. Geosci. 22(9) (1996) 1019–1026 25. Lawrence, R.D., Almasi, G.S., Rushmeier, H.E.: A scalable parallel algorithm for self-organizing maps with applications to sparse data mining problems. Data Min. Knowl. Discov. 3(2) (1999) 171–195 26. et al., T.K.: SOM PAK - the self organizing map program package (1995) 27. Ming, H.Y., Ahuja, N.: A data partition method for parallel self-organizing map. In: IJCNN’99. International Joint Conference on Neural Networks. Proceedings. Volume 3., Piscataway, NJ, IEEE Service Center (1999) 1929–33 28. Repository, U.M.L.

Suggest Documents