A Genetic Algorithm Based Approach for Systematic SOM Clustering ...

0 downloads 0 Views 184KB Size Report
directories. The approach couples a genetic algorithm with a neural network based clustering algorithm - Self-Organizing. Maps (SOM) - to systematically cluster ...
A Genetic Algorithm Based Approach for Systematic SOM Clustering of Directory Metadata Lei Li, Vijay Vaishnavi, Fellow, IEEE, and Art Vandenberg, Member, IEEE

Abstract— Directories play an important role in describing resources and enabling information sharing within and among organizations. To communicate effectively, directories must resolve differing structures and vocabularies. This paper proposes a systematic approach to address the interoperability of directories. The approach couples a genetic algorithm with a neural network based clustering algorithm - Self-Organizing Maps (SOM) - to systematically cluster directory metadata, highlight similar structures, recognize developing patterns of practice, and ultimately promote homogeneity among the directories. To evaluate the effectiveness of the proposed approach, an experiment on Lightweight Directory Access Protocol (LDAP) directory metadata is conducted. The experimental results show that a genetic algorithm can discover parameter values for a SOM algorithm such that the computer clustering results are comparable to that of domain experts. The proposed approach provides an effective mechanism to systematically cluster directory metadata and promote homogeneity among them.

Index Terms— Clustering analysis, Genetic Algorithm, LDAP directory, Self-Organizing Maps.

D

I. INTRODUCTION

irectories provide a general mechanism for describing resources and enabling information sharing within and among organizations [1]. The appropriate use of directory services is recognized as a key competitive advantage of organizations [2]. While directory services have focused on sharing information, they have primarily done so within an organization. To communicate effectively, directories must resolve differing structures and vocabularies. A traditional solution is to have every directory use a “standardized” directory schema. However, it may take a long time for a standards body to produce even a partial schema – for example, the EDUCAUSE/Internet2 eduPerson Task Force [3] took 18 This work is partially supported by NSF ITR Grant IIS-0312636; a sub award to NSF Grant No. ANI-0123937; Sun Microsystems Academic Equipment Grant EDUD 7824-010460-US; Georgia State University Brain and Behavior fellowship program; Georgia State University’s Robinson College of Business; and Georgia State University’s Information Systems & Technology. Lei Li is with the Department of Computer Information Systems at Georgia State University, Atlanta, GA 30303 USA (phone: 404-651-3869; email: [email protected]). Vijay Vaishnavi is with the Department of Computer Information Systems at Georgia State University, Atlanta, GA 30303 USA (e-mail: [email protected]). Art Vandenberg is with Information Systems & Technology at Georgia State University, Atlanta, GA 30329 USA, (e-mail: [email protected]).

1-4244-0134-8/06/$20.00 1-4244-0133-X/06/$20.00© ©2006 2006IEEE IEEE

643

months to adopt 6 attributes of the new “eduPerson” directory object. Moreover, a “static” standard cannot keep pace with the ever-expanding descriptive requirements of the directories. Another approach to solve the heterogeneity problem is the use of “boundary objects” that provide mediation [4] [5]. This strategy replicates selected, standard attributes from underlying source directories into a boundary directory, and enables queries against this boundary directory. This approach addresses the problem reactively, after the problem was created, and adds a mediating layer that provides limited interoperability. The above approaches to directory interoperability have various limitations: they address the problem after the fact, take too long, or lack flexibility. This paper, proposes a systematic approach to address the issue of interoperability of directories. It is argued that a good understanding of directory metadata (schemas) is essential to the interoperation of directories. Proper clustering and visualization of directory metadata can highlight similar structures, recognize developing patterns of practice, and ultimately promote the homogeneity among directories. This approach, uses SelfOrganizing Maps (SOM) [6] as the clustering algorithm and applies a Genetic Algorithm (GA) [7] to systematically search for appropriate SOM parameter values. The paper is organized as follows. Section 2 provides a brief discussion on clustering techniques including SOM algorithms. Section 3 describes the proposed GA-based approach for discovering SOM parameter values. Section 4 presents an empirical study on performance of the proposed approach. Section 5 concludes and suggests further work. II. CLUSTERING ANALYSIS AND SOM CLUSTERING Clustering analysis is a well-known approach to structure previously unknown and unclassified datasets [8]. It is particularly suitable for the exploration of interrelationships among the data points to assess their structure [9]. In this paper, a neural network based Self-Organizing Maps (SOM) algorithm [6] [10] is developed as the clustering algorithm. SOM produces a similarity graph (usually a twodimensional map) of input data by converting the nonlinear relationships among high-dimensional data into simple geometric relationships on a low-dimensional data. Researchers [11] [12] have compared SOM against other clustering algorithms and confirmed its superior clustering performance, especially in visualizing the clustering result. However, the quality of SOM clustering depends on the appropriateness of its parameter values for a given application domain [13]. There is no theoretical basis on how to select best SOM parameter values [6] and such selection usually is

done ad hoc [14] [15]. Moreover, SOM parameter values can range widely which results in a huge search space: there may be hundreds of thousands of possible parameter values sets. It becomes difficult and time consuming to select appropriate SOM parameter values. Since SOM clustering is the core of our approach and selection of appropriate SOM parameter values is critical to SOM’s clustering performance, it is important to have a systematic mechanism for selecting SOM parameter values. This leads to the research question for this paper: For LDAP directory metadata, can the SOM parameters values which produce clusters comparable to those generated by domain experts be discovered in a systematic way? III. GENETIC ALGORITHMS SOLUTION FOR SOM PARAMETER VALUES SELECTION Genetic algorithms make it possible to explore a larger range of potential solutions to a problem than do conventional methods [16] and often yield the globally optimal solution while avoiding combinatorial explosion by disregarding certain parts of the search space [17]. Selection of SOM parameter values is ultimately a search problem. It is thus appropriate to apply a genetic algorithm to find a good set of SOM parameter values. There are two important issues that need to be addressed before GA can be applied successfully: 1) Performance Measurement: How will GA measure the clustering performance? 2) Genetic Encoding: How will the SOM parameters be mapped into genomes? A. Fitness Measurement for SOM Clustering Result There are four widely accepted metrics to evaluate computer clustering performance: cluster error (CE), normalized cluster error (NCE), cluster recall (CR) and cluster precision (CP). Their definitions (adapted from [18]) can be found in [22]. The overall clustering performance is measured by Fmeasure value [19] [20] [21]. F-measure is a mechanism to provide for an overall estimate of the combined effect of CP and CR. The F-measure formula is expressed as: ( BETA^ 2 + 1) * CR * CP F − measure = ( BETA^ 2 * CP ) + CR

where BETA is the relative importance of CP vs. CR. The higher the F-measure value, the better the clustering result. We let F-measure be equal to 1, CP and CR equally weighted. B. Genetic Encoding The process to map SOM parameters to a genome in the GA genetic domain is called genetic encoding. The genetic genome used here is composed of 4 SOM parameters: x dimension (xdim), y dimension (ydim), neighborhood size, and final training iterations. Guided by studies in SOM clustering [14] [22] [23], xdim and ydim values range from 2 to 12, neighborhood size ranges from 2 to 10, and final training iteration rate values range from 5,000 to 60,000.

1-4244-0133-X/06/$20.00 © 2006 IEEE

644

C. Execution of GA The GA starts its first generation with a randomly defined initial population of genomes (individuals). GA decodes each genome (converts them into corresponding SOM parameter values), uses the values to run the SOM algorithm, and then calculates the fitness function value (based on F-Measure) by comparing the SOM clustering result to domain experts’ clustering results. When all a generation’s genomes are evaluated, GA tests whether the termination condition is met. If so, the best genome from the generation is selected and the program ends. Otherwise, GA performs selection, crossover, or mutation operations to generate a new population. This process of generating new populations, running SOM, calculating fitness function value, and testing the termination condition is repeated until the termination condition is met. IV. EXPERIMENT AND RESULTS To test the effectiveness of the proposed approach, a research prototype was developed and an experiment was conducted on a set of LDAP directory metadata. In a prior work [22], a linear search approach was used in an attempt to find a good SOM parameter values set by trying out 320 predefined combinations of SOM parameter values. This paper proposes a GA-based approach to systematically identify good SOM parameter values and is expected to have a better outcome than linear search. Domain experts are used to evaluate computer clustering performance. A computer clustering result which is comparable to the one created by domain experts is considered as good clustering. This suggests the following two hypotheses: H1: For LDAP directory metadata domain, it is possible to use a genetic algorithm to find a set of SOM parameter values whose clustering performance is comparable to the clustering performed by domain experts. H2: The GA-generated SOM parameter values set will have equal or better clustering performance than the performance of SOM parameter values set discovered in the linear search. A. Experiment Design LDAP directory metadata from a large pubic university was used as the experimental data. This dataset contains 191 object classes; each object class having a set of attributes. The object classes were divided into two groups: a training dataset (TDS) -- two thirds (128) of the object classes, to be used for training; a holdout dataset (HDS) -- one third (63) of the object classes, to be used for testing. Domain experts (6 months to 3 years of experience using LDAP directory data) were asked to manually cluster targeted directory metadata and their clustering was used as a standard to evaluate computer generated clustering. B. Experiment Results and Analysis 1) Establishing a Standard of Experts’ Clustering Results Six domain experts participated in the experiment. Considerable variation appeared in the clusters created by the experts for both the training dataset (TDS) and the holdout dataset (HDS) thus making it difficult to find a “master expert.” The computer clustering metrics results were

evaluated by comparing the computer results with each of the experts and then calculating a mean for the collective metrics results. 2) Current Experiment Results and Analysis In the current experiment, the GA-based approach was first applied to the training dataset (TDS). The genetic-encoding values for running GA are summarized in Table 1. Table 2 lists the SOM parameter values discovered by the GA approach. The performance of SOM parameter values was evaluated using clustering results of domain experts. Lastly, the discovered SOM parameter values were used to cluster the holdout dataset (HDS). The performance metrics values for both TDS and HDS datasets are shown in Table 3. Table 1. GA Parameter Values Population Cross- Mutation xdim ydim Neigh- Training Size over Prob. borhood Iteration Prob. Size 5,000 100 0.8 0.08 2-12 2-12 2-10 -60,000

xdim 10

Table 2. GA Generated SOM Parameter Values ydim Neighborhood Size Training Iterations 8 3 27,968

Table 3. Performance Metrics for the Proposed GA-based Approach Expert Expert Expert Expert Expert Expert Mean 1 2 3 4 5 6

Table 4. T-test Results for the Comparison of GA Approach with Human Experts TDS HDS Metrics CE NCE CR CP CE NCE CR CP P-value 0.345 0.482 0.371 0.073 0.403 0.420 0.469 0.073 Note: α = 0.05 ; two-tails t-test. H2: The GA-generated SOM parameter values set will have equal or better clustering performance than the performance of SOM parameter values set discovered in the linear search. Table 5 lists the performance metrics results of the GAbased approach and the linear search approach of our prior research [22]. F-Measure value is the major factor for overall clustering performance. Clearly, the F-Measure value of the GA-based approach (0.4512) is larger than the F-Measure value of the linear search (0.4078). In addition, the NE and NCE values of GA-based approach are smaller than the ones of the linear search (0.0865, 0.5486 versus 0.0936, 0.5922); CR of the two approaches is similar (0.4172 versus 0.4198); CP of GA approach is higher than CP of linear search approach (0.6620 versus 0.5364). Further, the clustering result of GA-based approach is closer to domain experts (see Table 3) than is the case for the linear search approach. Overall, the experiment indicates the SOM parameter values generated by the Genetic Algorithm have a better performance than the ones generated by the linear search. So, Hypothesis 2 is supported. Table 5. Performance Metrics Values Metrics Linear Search GA Approach CE 0.0936 0.0865 NCE 0.5922 0.5486 CR 0.4198 0.4172 CP 0.5364 0.6620 F Value 0.4078 0.4512

CE 0.032 0.025 0.031 0.025 0.289 0.119 0.087 NCE 0.412 0.484 0.437 0.385 0.857 0.717 0.549 T D CR 0.493 0.663 0.497 0.598 0.079 0.174 0.417 S CP 0.729 0.422 0.649 0.634 0.781 0.757 0.662 CE 0.047 0.171 0.216 0.032 0.026 0.025 0.086 H NCE 0.315 0.600 0.659 0.235 0.198 0.195 0.367 D CR 0.625 0.262 0.215 0.765 0.824 0.849 0.590 S CP 0.758 0.841 0.826 0.765 0.780 0.765 0.789 Below, the experiment results are analyzed with respect to the hypotheses. H1: For LDAP directory metadata domain, it is possible to use a genetic algorithm to find a set of SOM parameter values whose clustering performance is comparable to the clustering performed by domain experts.

V. CONCLUSIONS AND DISCUSSION

In order to confirm H1, results should show that computer generated clustering results couldn’t be distinguished from clusters created by domain experts. Two groups were analyzed: Group 1 is composed of evaluation metrics results from expert clustering; Group 2 contains the evaluation metrics results for computer clustering. A t-test was performed on these two groups for each of the four clustering metrics. The result of the t-Test is listed in Table 4. None of the TDS P-values in Table 4 are significant at 0.05 alpha level. This indicates that there is no difference between the two groups being compared. The clustering result of HDS (Table 3, HDS section) shows the GA generated SOM parameter values perform well on the holdout dataset. Based on above discussion, it can be concluded that the clustering performance of the GA-generated SOM parameter values is

1-4244-0133-X/06/$20.00 © 2006 IEEE

comparable to the performance of human experts. Hypothesis 1 is thus supported.

645

Clustering of LDAP directory metadata at a domain expert level is a particularly important contribution to promoting interoperability among directories. This paper proposes a GAbased approach to systematically find good SOM parameter values that can cluster LDAP directory metadata at a domain expert level. The experiment shows the proposed approach significantly improves the traditional (linear) SOM parameter values search method. Based on our knowledge, this is the first study that attempts to systematically find good SOM parameter values for a particular application domain. The proposed GA approach to discovering SOM parameter values provides an effective mechanism to process large search space of directory metadata values and so promotes semantic homogeneity among directories. VI. LIMITATIONS AND FUTURE RESEARCH. This paper treats the clustering of only a single LDAP directory, which may limit the generalizability. Testing the same SOM parameter values on another LDAP directory

remains to be done. There are considerable variations in the clustering results of the six domain experts and a “representative expert” was constructed by using a statistical mean and variance to represent collective results. GA is very compute-intensive. It sometimes takes several hours for a single run. Using Grid [24] [25] and other parallel computing techniques should significantly reduce the turnaround time for the GA, and this work is in progress. One area of investigation is to test the approach across multiple directories, e.g. Are SOM parameter values discovered in one directory still good for another directory? Another area of study is to explore the possibility of automating the proposed approach; in the current study, domain experts are still needed to manually cluster the directory data. It would make the approach more practical and more useful if a set of reference clusters are automatically generated and used for evaluation. REFERENCES [1] T. Howes, M. Smith, and G. S. Good, Understanding and Deploying LDAP Directory Services: Macmillan Technical Pub, 1999. [2] S. Hayward, J. Graff, and N. MacDonald, "Business strategy will drive directory services," The GartnerGroup 1999. [3] "eduPerson Object Class," Net@Edu, EDUCAUSE/Internet2 eduPerson task force, 2005. [4] I. Foster, C. Kesselman, and S. Tuecke, "The Anatomy of the Grid, Enabling Scalable Virtual Organizations," International Journal Supercomputer Applications, vol. 15, pp. 200-222, 2001. [5] N. Nikols, "Directory Project Cookbook, V2 October 20, 2004," in Directory and Security Strategies, Methodologies and Best Practices: The Burton Group, 2004. [6] T. Kohonen, Self-Organizing Maps. New York: Springer, 2001. [7] J. H. Holland, "Genetic Algorithms," Scientific American, pp. 66-72, 1992. [8] A. Nürnberger, "Clustering of document collections using a growing self-organizing map," presented at Proceedings of BISC International Workshop on Fuzzy Logic and the Internet, 2001. [9] A. K. Jain, M. N. Murty, and P. J. Flynn, "Data Clustering: A Review.," ACM Computing Surveys, vol. 31, pp. 264-323, 1999. [10] T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen, "SOM_PAK: The self-organizing map package, version 3.1," SOM programming team of the Helsinki University of Technology, 1995. [11] P. Mangiameli, S. K. Chen, and D. West, "A comparison of SOM Neural Network and Hierarchical Clustering Methods," European Journal of Operational Research, vol. 93, pp. 402-417, 1996. [12] H. Zhao and S. Ram, "Clustering Schema Elements for Semantic Integration of Heterogeneous Data Sources," Journal of Database Management, vol. 15, pp. 88-106, 2004. [13] D. Polani and T. Uthmann, "Training Kohonen Feature Maps in different Topologies: an Analysis using Genetic

1-4244-0133-X/06/$20.00 © 2006 IEEE

646

Algorithms," presented at Proceedings of the Fifth International Conference on Genetic Algorithms, San Mateo, CA, 1993. [14] M. Y. Kiang, U. R. Kulkarni, and K. Y. Tam, "Selforganizing map networks as an interactive clustering tool-An application to group technology," Decision Support System, vol. 15, pp. 351-374, 1995. [15] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, A. Paatero, and V. Saarela, "Self Organization of a Massive Document Collections, Special Issue on Neural Networks for Data Mining and Knowledge Discovery," IEEE Transactions on Neural Networks, vol. 11, pp. 574585, 2000. [16] J. H. Holland, "Genetic Algorithms," Scientific American, vol. 267, pp. 66-72, 1992. [17] Q. Wu, S. S. Iyengar, N. S. V. Rao, J. Barhen, V. K. Vaishnavi, H. Qi, and K. Chakrabarty, "On computing the route of a mobile agent for data fusion in a distributed sensor network," IEEE Transactions on Knowledge and Data Engineering, vol. 16, pp. 740-753, 2004. [18] D. G. Roussinov and H. Chen, "Document clustering for electric meetings: an experimental comparison of two techniques," Decision Support Systems, vol. 27, pp. 6779, 1999. [19] B. Larsen and A. Aone, "Fast and Effective Text Mining Using Linear-time Document Clustering," presented at Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, 1999. [20] B. Stein and S. M. Z. Eissen, "Document Categorization with Major CLUST," presented at 12th Annual Workshop On Information Technologies And Systems (WITS'02), Barcelona, Spain, 2002. [21] C. Van Rijsbergen, Information Retrieval, 2nd edition ed. Butterworth, London, 1979. [22] J. Liang, V. K. Vaishnavi, and A. Vandenberg, "Clustering of LDAP Directory Schemas to Facilitate Information Resources Interoperability Across Organizations," IEEE Transactions on System, Man, and Cybernetics, Part A, 2006; to appear. [23] C. Lin, H. Chen, and J. F. Nunamaker, "Verifying the proximity and size hypothesis for self-organizing maps," Journal of Management Information Systems, vol. 16, pp. 57-70, 1999. [24] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. San Francisco, CA: Morgan and Kaufmann, 1999. [25] I. Foster, C. Kesselman, J. M. Nick, and S. Tuecke, "The physiology of grid : An Open Grid Services Architecture for Distributed Systems Integration," presented at Open Grid Service Infrastructure Working Group, Global Grid Forum, Edinburgh, Scotland, 2002.

Suggest Documents