A SOM-Based Classifier with Enhanced Structure Learning - CiteSeerX

5 downloads 448 Views 288KB Size Report
A SOM-Based Classifier with Enhanced Structure. Learning. *. Christos Pateritsas. School of Electrical and Computer. Engineering. National Technical Univ. of.
A SOM-Based Classifier with Enhanced Structure Learning* Christos Pateritsas Minas Pertselakis Andreas Stafylopatis School of Electrical and Computer School of Electrical and Computer School of Electrical and Computer Engineering Engineering Engineering National Technical Univ. of National Technical Univ. of National Technical Univ. of Athens Athens Athens Athens, Greece Athens, Greece Athens, Greece [email protected] [email protected] [email protected] Abstract - This paper introduces an innovative synergistic model that aims to improve the efficiency of a neuro-fuzzy classifier, providing the means of on-line adaptation and fast learning. It combines the advantages of a self-organized map (SOM) network, as well as the benefits of a structure allocation fuzzy neural network. The system initializes its parameters using the clustering result on the SOM structure, while a novel approach of evaluating the input features leads to a more efficient way of handling the on-line learning rate of the training process. Experimental results on benchmark classification problems showed that this robust combination can also tackle tasks of great dimensionality in a successful manner. Keywords: Self-organized map, feature evaluation, neuro-fuzzy, classification, structure-learning.

1

Introduction

Self-organized Maps (SOMs) employ an unsupervised learning algorithm that achieves dimensionality reduction by compressing the data, via training, to a reasonable number of units (neurons)[14]. At the same time, they manage to preserve the topology of the data space and, thus, SOM networks are considered capable of handling problems of large dimensionality. The map consists of a grid of units that contain all the significant information of the data set, while eliminating possible noise data, outliers or data faults. Adjacent units on the map structure correspond to similar data patterns, allowing to identify regions of interest through various clustering techniques. Numerous examples of such techniques have been proposed in the literature with the most common methods involving either the fuzzy c-means algorithm or *

0-7803-8566-7/04/$20.00 © 2004 IEEE.

hierarchical clustering algorithms [3],[5]. Applications of SOM clustering include, but are not limited to, feature extraction or feature evaluation from the trained map [1], [2] and data mining [15],[8]. However, most of the aforementioned methods deal only with the original SOM scheme and not the structures that can be extracted from each input feature separately. These structures are known as component planes [6] and carry information that the original map does not. Thus, if processed carefully, they can produce a more detailed description of the problem and enhance the performance of a network. Integrated neural fuzzy networks, on the other hand, usually present low efficiency and performance when the initialization of their free parameters is inaccurate, especially in highly non-linear problems. In addition, when the dimensionality of the problem is large, standard neuro-fuzzy systems perform poorly and require too many rules, which often leads to an increased computational cost. Considering the training process of a network, it has been proved that the proper manipulation of the learning rate has a significant effect on the network performance, since it is closely related to the rate of convergence. Hence, a number of different methods for its adaptation have been proposed in the literature [11]. Embedding data-driven knowledge has been shown to improve the performance of the network and to lead to faster learning and better generalization while requiring a relatively small training set size [7]. When such knowledge is extracted from numerical data, a common approach is to use either clustering or partitioning methods to derive the rules. The selection of the number of rules, though, leads to a static structure of the network. Thus, the performance of a fuzzy neural network in non-stationary environments, where on-line training and adaptation are critical issues, is usually based on heuristic techniques.

Resource Allocating Network (RAN) architectures [4], were found to be suitable for online modeling of nonstationary processes. Using a sequential learning method the network initially contains no hidden nodes. During the training process, though, new hidden nodes are added to the network according to the training performance. The idea of implementing a modified RAN structure into a fuzzy neural inference system was introduced in [10]. Following the same concept we facilitate the SOM properties into the system, improving its learning abilities. The system is first initialized using the results of the clustering on the self-organized map. Then, the network either adjusts the existing free parameters using a least mean square gradient descent with a dynamic learning rate or increases its number of rules, based on two criteria. The first criterion depends on the prediction error, while the second states that the distance between the rule in question and the winning rule should be greater than a threshold. If both those criteria are satisfied, then the data is memorized and a new hidden node is added to the network. This robust combination addresses adequately the low efficiency presented in fixed neural fuzzy networks and facilitates fast learning even in tasks of great dimensionality and complexity.

2

The Proposed Methodology

To address the above issues, the proposed model facilitates data-driven knowledge and offers a quite accurate initialization by using a SOM clustering method, thus exploiting SOM’s ability to handle high-dimensional problems. Consequently, the initial number of rules is not based on heuristic techniques. On-line adaptation is accomplished by employing a Resource Allocating Network (RAN) architecture [4]. This ensures a non-static structure that proves to be crucial in non-stationary environments. Furthermore, we enhance the effectiveness of the learning rate by extracting a degree of significance of each input feature with respect to each initial cluster. Our approach includes the following sequential steps: 1. Train SOM with the whole dataset 2. Clustering of Trained SOM 3. Feature Evaluation a. Repeat clustering for each input feature b. Matching results with original Map 4. Initialization of the Neuro-Fuzzy Model 5. Supervised Learning a. Derive learning rate from feature significance b. On-line Resource Allocation

2.1

SOM Clustering

The identification of clusters in the data set is done in two phases. First, the SOM network is trained and then an agglomerative single-linkage clustering algorithm is applied to the vectors representing the units of the trained

map, in order to find clusters containing more than one map unit. In the beginning of the second phase every cluster consists of only one unit of the trained map. In each step of this phase, the two clusters with the smallest distance dkl are merged. Distance dkl is defined as the smallest value of the individual distances between cluster k map units and cluster l map units:

{

d kl = min ij ,i =1.. N k , j =1.. N l mi − m j

}

(1)

where mi is the vector of the i-th unit of cluster k and mj the vector of the j-th unit of cluster l, and Nk (Nl) denotes the total number of units in cluster k (l). This distance d must not be significantly larger than the Nearest Neighbor Inner Cluster (NNIC) distance of each of the two clusters to be merged. The NNICk distance of cluster k is given by: Nk

NNICk =

∑ min i =1

ij , j =1.. N k , ( i ≠ j )

{m − m } i

j

Nk

(2)

where mi and mj are the vectors of units i, j that belong to the same cluster k. In order to exploit the SOM topology, only clusters containing adjacent map units can be merged. Moreover, the units that do not win any or win a very small number of data patterns (interpolating units) are excluded from the second phase clustering. This allows the formation of borders on the map grid, which separate the areas whose units must be grouped to form the final clusters. The optimal number of clusters is estimated by using a clustering validity measure that includes the combination of the RMSSTD (Root Mean Square Standard Deviation) and RS (R- Squared) indices [13].

2.2 Feature Evaluation and Similarity Factor The self-organized map and the clustering procedure are not only used for the pre-clustering of the data, but it is also exploited so that significant information from each one of the input features can be extracted. That occurs if the same clustering technique is repeated by applying a feature-weighting scheme once for every input feature. In each repetition, one of the n input features is favored with respect to the others by a multiplier factor so that the clustering procedure is driven mostly by that feature. The value of this factor is not fixed, but it should be high enough to make the desired feature distinguish. This procedure results in the creation of n additional partitions of the data. The next step involves the matching of each one of these partitions with the primary cluster structure. In this step we calculate the degree of similarity of the clusters created from each one of the additional partitions and the clusters of the primary partition. The purpose is to

estimate how important is an input feature for the creation of a cluster in the primary partition. For example, if a cluster from one of the additional partitions matches exactly a cluster from the original partition, then the input feature that was highly weighted in this additional partition is considered to be important for the creation of the corresponding cluster in the primary partition. It should be noted that the number of the clusters in each partition might vary, since the agglomerative clustering procedure is applied on a different dataset each time due to the multiplier factor. The matching of the clusters is realized by determining the common units between each cluster of the original SOM and the clusters of each one of the n additional partitions. A similarity factor (SF) is calculated for each pair according to the following expression:

SF jki =

card (C j ∩ Cki ) card (C j ∪ Cki )

(3)

where Cj is the set of the units of cluster j belonging to the partition of the original map and Cki is the set of units for cluster k of the i-th additional partition, while card denotes set cardinality.

fuzzifiers, while network connections, antecedent and consequent weights, are represented by Gaussian membership functions specified by a center and a spread. The numeric input is fuzzified by treating it as the center

xic of a Gaussian membership function with a tunable σ

spread xi . Rule based knowledge is easily translated directly into the network architecture in the form of fuzzy if-then rules that are embedded as hidden nodes. However, when inserting new hidden nodes all connections between inputs and the newly created node should be created and pruned as the learning process evolves. The activation value zj of the rule node j is computed as: n

z j = ∏ Eij

(4)

i =1

where Eij represents mutual subsethood between fuzzy input i and fuzzy weight wij [12]. The signal of the output node k is given by: q (t )

∏z v j

yk (t ) =

c jk

v sjk

j =1 q (t )

∏z v

(5)

s j jk

j =1

c

σ

where vij , vij the center and the spread of the consequent weights respectively and q the number of rule nodes.

2.4

Figure 1. Example of matching between features and original map Finally, we choose the Maximum Similarity Factor (MSFij) for each cluster j of the original partition in relation to each feature i. This is used as a degree of significance of each feature with respect to each initial cluster.

2.3

The Neuro-fuzzy Model Architecture

The architecture and functionality of the neuro-fuzzy model is based on the ARANFIS approach [10], which applies the RAN concept to the SuPFuNIS model [12]. The architecture includes three layers: the input layer, the rule layer and the output layer. We consider that the network consists of n inputs, p outputs and q(t) hidden nodes at iteration t, since its structure dynamically changes. Numeric inputs are fuzzified by input nodes, which act as tunable feature

Initialization of the System

The initialization of the free parameters of the system is accomplished using the results of the SOM original clustering. The centers of the antecedent weights take the values of the centers of the corresponding clusters, while the centers of the consequent weights derive from a supervised learning technique on SOM, that produces the percentage of each class in every cluster. The spreads of the parameters are taken randomly in a pre-determined range of values to preserve the ability of our system to generalize. Learning is incorporated into our system using the gradient descent method. A squared error criterion is used as a training performance parameter, which, at iteration t, is computed in the following way:

e(t ) =

1 p ∑ (d k (t ) − yk (t ))2 2 k =1

(6)

where dk(t) is the desired output and yk(t) is the defuzzified output at node k. The error is evaluated over all p outputs for each input pattern x(t). The free parameters of the system, meaning both the centers and spreads of antecedent and consequent connections, as well as the spreads of the input features, are modified on the basis of update equations taking the following general form:

u (t + 1) = u (t ) − η (t ) ⋅ β ij

∂e(t ) ∂u (t )

(7)

where p is the number of outputs and q(t) denotes the new σ

rule. The weight spreads vq ( t ) k are set randomly to values σ

where η (t ) is the online computed learning rate and β ij is a novel parameter for soft competitive learning. The evaluation and expressions of the partial derivatives can be found in [10].

2.5

Enhancing Structure Learning

The value of the learning rate η has a significant effect on network performance, since it is closely related to the rate of convergence. It was discovered that the appropriate manipulation of η during the training process can lead to very good results and, hence, a large number of different methods for its adaptation have been proposed in the literature [9], [11]. Our approach involves the parameter β ij (eq. 7), which indicates the degree of importance of the input feature xi with respect to the i-th cluster (rule node). More specifically, we use the maximum similarity factor (MSF) computed at the previous step, represented as a matrix K, where kij = MSFij. The inverse of this matrix produces the β values to multiply the learning rate for each rule:

β ij =

1 K ij

(8)

Therefore, what we accomplish is that we allow “good” weights to converge quickly, while, on the other hand, we try to slow down “bad” weights, widening their selection of values. The latter is applied to all input spreads as well.

2.6

Adding a new rule

Training data are inserted into our model in the form of pairs (x(t), d(t)) of the input and the desired output vectors respectively. If a new input x(t) does not significantly activate any rules and the prediction error is large, a new rule is created. In other words, if: d k (t ) − yk (t ) > ε = 0.5 (prediction error) and

max{z j } < d = 0.5 (rule activation),

rigorous conditions implies higher

3

ε and lower δ .

Experimental Results

We tested our method on two benchmark problems of real data, which are mainly characterized by overlapping clusters and many attributes. Due to the random initialization of the spreads, the number of the produced rules may vary. For our experiments we used 70% of the patterns for training and 30% for testing.

3.1

Ionosphere Data

This radar data was collected by a system in Goose Bay, Labrador and consists of 351 data records, usually 200 taken for training and 151 for testing. We used two methods to find the accuracy; the first one takes 201 data for training and 150 for testing, while the other one takes 250 for training and the whole dataset (all 351 patterns) for resubstitution testing. The SOM map produced 9 initial clusters. The results are the average of 10 experiments with the same starting conditions and are shown in Table 1 and Table 2 respectively. It should be noted that these results were produced after 10 epochs. The improvement of the classifying accuracy after those epochs was minor. A comparison with other known classifiers [16] is depicted in Table 3. Table 1. Results for Ionosphere data (201 Train – 150 Test). Final Rules Accuracy (%)

13 93

15 94.4

16 96.67

Table 2. Results for Ionosphere data(251 Train – 351 Test). Final Rules Accuracy (%)

13 92.3

15 92

16 92.6

Table 3. Comparative results for Ionosphere data

j

then a new hidden node is allocated and the total number of rules is increased. The weight centers of the antecedent part take the crisp values of x(t), while the corresponding spreads are set proportionally to the distance from the existing nearest hidden node to the new node. In this way new inputs are more likely to match the newly created hidden node. The consequent weight centers are initialized as:

v qc (t ) k = d k (t ) − y k (t ) , k = 1,..., p

σ

lying in the range [min v jk , max v jk ] . Setting more

(9)

Method 3-NN + simplex 3-NN Our Method 1-NN, Manhattan MLP+BP C4.5 SVM FSM + rotation Linear Perceptron CART

Accuracy (%) 98.7 96.7 96.67 96.0 96.0 94.9 93.2 92.8 90.7 88.9

Figure 2. Original SOM map and map after clustering of Ionosphere data.

3.2

Figure 3. Original SOM map and map after clustering of Image Segmentation data.

Image Segmentation

The instances of this data set were drawn randomly from a database of 7 outdoor images. The images were hand-segmented to create a classification for every pixel. Each instance is a 3x3 region, resulting in 19 continuous attributes and 7 classes. For our experiments we used 1610 patterns for training (230 instances per class) and the remaining 700 (100 instances per class) for testing. The data were scaled down to an input interval of [-2,2]. The SOM map produced 4 initial clusters. The results after 10 and 100 epochs, respectively, are illustrated Table 4. A comparison with other known classifiers [16] is depicted in Table 5. Table 4. Results for Image Segmentation data Epochs Final Rules Accuracy Train (%) Accuracy Test (%)

10 6 74.22 72.26

100 9 81.8 74.91

Table 5. Comparative results for Image Segmentation data Method Our Method Itrule Descrim CASTLE RBF Kohonen Backprop C4.5

Error(Train) 0.050 0.445 0.112 0.108 0.047 0.046 0.028 0.013

Error(Test) 0.091 0.045 0.116 0.112 0.069 0.067 0.054 0.040

4

Conclusions

This paper presents a robust system that initializes its variables using the results of a clustering procedure on a self-organized map. This approach proves to be cost efficient since the feature evaluation procedure is applied to the self-organized map and not to the whole dataset. Moreover, a novel learning procedure that derives from feature evaluation of the same self-organized map is employed successfully. The ability of the network to adapt itself using on-line structure learning and resource allocation proves to be crucial. The experimental results indicate that good performance and fast learning can be combined efficiently. Future work includes different approaches of extracting additional information from the SOM structure to enhance the performance of the neuro-fuzzy system, while reducing its computational cost.

References [1] A. Rauber, “LabelSOM: On the labeling of selforganizing maps,” Proceedings of International Joint Conference on Neural Networks, Washington, DC, 1999. [2] A.Ultsch and D.Korus, “Integration of neural networks with knowledge-based systems” IEEE Int. Conf.Neural Networks, Perth, 1995. [3] F. Hoeppner, F. Klawonn, R. Kruse, T. A. Runkler, “Fuzzy Cluster Analysis-Methods for Image Recognition, Classification, and Data Analysis”, John Wiley & Sons, Chichester, 1999.

[4] J. Platt, “A resource-allocating network for function interpolation,” Neural Computing, vol. 3, no. 2, pp. 213225, 1991. [5] J. Vesanto and E. Alhoniemi, “Clustering of the selforganizing map”, IEEE Trans. On Neural Networks, vol. 11, pp. 586-600, 2000. [6] J.Iivarinen, T.Kohonen,J. Kangas,S. Kaski, "Visualizing the Clusters on the Self-Organizing Map", Conference on Artificial Intelligence Research in Finland, Helsinki, Finland, pp. 122-126, 1994. [7] L. M. Fu, “Learning capacity and sample complexity on expert networks”, IEEE Trans. On Neural Networks, vol. 7, pp. 1517-1520, 1996. [8] M. Drobics, U. Bodenhofer, W. Winiwarter, “Data Mining Using Synergies Between Self-Organizing Maps and Inductive Learning of Fuzzy Rules”, Joint 9th IFSA World Congress and 20th NAFIPS Int. Conf., 2001. [9] M. Moreira and E. Fiesler, “Neural networks with adaptive learning rate and momentum terms”, IDIAP, Martigny, Switzerland, Tech. Rep., pp95-04, 1995. [10] M. Pertselakis, N. Tsapatsoulis, S. Kollias, A. Stafylopatis, “An Adaptive Resource Allocating Neural Fuzzy Inference System,” Proceedings of IEEE Intelligent Systems Application to Power Systems, ISAP'03, Limnos, Sep. 2003. [11] M. V. Solodov and B. F. Svaiter. “A comparison of rates of convergence of two inexact proximal point algorithms”, Nonlinear optimization and related topics, G. D. Pillo and F. Giannesi,Ed. Applied Optimization 36, Kluwer Academic Publishers, pp. 415-427, 2000. [12] S. Paul, S. Kumar, “Subsethood-Product Fuzzy Neural Inference System (SuPFuNIS)”, IEEE Trans. On Neural Networks, vol. 13, No. 3, pp. 578-599, 2002 [13] S.C. Sharma, Applied Multivariate Techniques, John Wiley & Sons, 1996. [14] T. Kohonen, “Self-Organizing Maps”, Information Sciences. Springer, second edition, 1997. [15] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A. Saarela, “Self organization of a massive document collection”, IEEE Trans. On Neural Networks, vol. 11, pp. 574-585, 2000. [16] -,“Datasets used for classification: comparison of results”, Nicholaus Copernicus University, Department of Informatics, Torun, Poland . http://www.phys.uni.torun.pl/kmk/projects/datasets.html

Suggest Documents