IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
631
Clustering of LDAP Directory Schemas to Facilitate Information Resources Interoperability Across Organizations Jianghua Liang, Vijay K. Vaishnavi, Fellow, IEEE, and Art Vandenberg, Member, IEEE
Abstract—Directories provide a well-defined general mechanism for describing organizational resources such as the resources of the Internet2 higher education research community and the Grid community. Lightweight directory access protocol directory services enable data sharing by defining the information’s metadata (schema) and access protocol. Interoperability of directory information between organizations is increasingly important. Improved discovery of directory schemas across organizations, better presentation of their semantic meaning, and fast definition and adoption (reuse) of existing schemas promote interoperability of information resources in directories. This paper focuses on the discovery of related directory object class schemas and in particular on clustering schemas to facilitate discovering relationships and so enable reuse. The results of experiments exploring the use of self-organizing maps (SOMs) to cluster directory object classes at a level comparable to a set of human experts are presented. The results show that it is possible to discover the values of the parameters of the SOM algorithm so as to cluster directory metadata at a level comparable to human experts. Index Terms—Clustering analysis, clustering evaluation, LDAP directories, neural network configuration, self-organizing maps.
I. I NTRODUCTION
D
IRECTORIES provide a well-defined general mechanism for describing resources within an organization and for enabling their discovery by individuals and applications [13]. Lightweight directory access protocol (LDAP) directory services enable data sharing by defining both the information metadata (schema) and its access protocol. While arguably a simple concept, the appropriate use of directory services is being recognized as a key to competitive advantage [11]. This is not surprising given the increasing focus on organizational
Manuscript received August 1, 2003; revised January 9, 2004 and May 26, 2004. This work was supported in part by the National Science Foundation (NSF) under ITR Research Grant IIS-0312636, in part by a subaward to NSF Grant ANI-0123937, in part by Sun Microsystems under Academic Equipment Grant EDUD 7824-010460-US, and in part by Georgia State University through the Robinson College of Business and the Department of Information Systems and Technology. This paper was recommended by Associate Editor C. Hsu. J. Liang is with Lexmark International, Inc., Lexington, KY 40550 USA (e-mail:
[email protected]). V. K. Vaishnavi is with the Department of Computer Information Systems, Robinson College of Business, Georgia State University, Atlanta, GA 30303 USA (e-mail:
[email protected]). A. Vandenberg is with the Information Systems and Technology, Georgia State University, Atlanta, GA 30303 USA (e-mail:
[email protected]). Digital Object Identifier 10.1109/TSMCA.2005.851277
learning and the potential for directory services to consolidate important facets of organizational knowledge; indeed, a core issue in knowledge management is the identification of potential sources of knowledge [10], [47]. While directory services have focused on sharing information, they have primarily focused on doing so within an organization and relatively little attention, until recently, has been given to sharing information across organizational boundaries. Increasingly, a core source of competitive advantage is recognized to consist of making optimal use of internal resources as well as potential external resources. Compelling arguments have been advanced that the appropriate use of external resources is not only necessary, but also essential to success in the future business environment [36]. Examples of organizations that use directories to describe their resources are the Internet2 higher education research community [16] and the Grid community [9], both of which use LDAP structures, respectively, for describing people and Grid computing components to support the work of their organizations. The directories working group of the Internet2 Middleware Architecture Committee for Education (MACE) considers “Directories . . . the operational linchpin of almost all middleware services. They can contain critical customization information for people, processes, resources, and groups” [5]. The MACE working groups are developing standard schemas (persons, organizations, videoconferencing, groups, . . .), recommending best practices for metadirectory architecture, and implementing directory-based federated interorganizational authentication and authorization for the higher education research community [33]. The Grid community’s Globus Toolkit 3.0 is delivered with an OpenLDAP directory used to store resource information of distributed Grid components [8], [37]. Other organizations similarly use LDAP as a mechanism for directory-enabled authentication, mail, or network management services, and indeed, LDAP is a core element of network services offered by vendors such as Microsoft or Novell. While such directory-enabled services can be provided within an organization, offering or accessing these services beyond the home organization requires significant coordination of directory architectures and standards for the directory schema. The focus of the current work on LDAP directory metadata is in part based on our working experience with directory initiatives in the Internet2 and Grid communities [51]. A directory (object class) schema is created by one of three sources: a standards body, the vendor of a directory product, or a directory administrator. It may take a long time for any
1083-4427/$20.00 © 2006 IEEE
632
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
Fig. 1. Taxonomy of clustering approaches [53].
standards body to produce a schema [6]—the EDUCAUSE/ Internet2 eduPerson Task Force [7] took 18 months to adopt six attributes for “eduPerson.” Slow definition of standard schemas may impede the usability and interoperability of information resources in interdirectory services. Vendors of LDAP directories may deliver their products in which schemas have alternate names, include schemas that are specific to their own products, or introduce new schemas, which tend to lag the research and development community. A key challenge in directory schema design is for the directory administrator to locate appropriate predefined schemas to reuse [13]. Indeed, directory administrators often create new schemas or extend standard schemas to meet local needs, thus further impeding interoperability. Improved discovery of existing schemas, better presentation of semantic meaning, and fast definition and adoption (reuse) of schemas across organizations would promote the usability and interoperability of information resources in directories. The objective of this paper is to explore how LDAP object classes can be clustered to facilitate discovery of related object classes to better enable their reconciliation and reuse. This is a novel approach to an important instance of the metadata integration or the semantic interoperability problem that many domains have been facing [12], [48]. Our approach intends to integrate metadata with little prior knowledge of prior semantic relationships thus seeking to automate aspects of metadata knowledge modeling that is needed in the mediation approach to metadata integration; Rensselaer’s metadatabase system for integrating heterogeneous distributed information systems, more specifically manufacturing systems, is a very good example of the mediation approach [4], [14], [15]. The rest of the paper proceeds as follows: Section II discusses why clustering analysis is a good technique to discover object interrelationships, why self-organizing maps (SOMs) is a good clustering technique to use, and our approach to applying SOM. Section III presents features selection, addresses preliminary SOM parameter issues needing resolution, states the research question, and discusses the metrics used for evaluating clustering results. Section IV describes our empirical study and describes the hypotheses and the experiments conducted
to study the hypotheses. Section V presents the results of our experiments. Section VI draws conclusions and makes suggestions for further work. II. C LUSTERING A NALYSIS AND SOM Clustering analysis is a well-known approach to structure previously unknown and unclassified datasets [38]. Clustering is useful when there is little prior information (e.g., statistical models) available about the data, and the decision maker must make as few assumptions about the data as possible. It is under these restrictions that clustering technology is particularly appropriate for the exploration of interrelationships among the data points to assess their structure [19]. Since directory schemas are extensible and there exist no predefined schema categories, it is appropriate to use clustering techniques to structure them. This observation has been corroborated by the research results of Zhao and Ram. They point out that clustering analysis is more suitable than classification for the identification of schema-level correspondences [53], [54]. Different approaches to clustering data can be described with the help of the hierarchy shown in Fig. 1 [53]. The availability of such a vast collection of clustering algorithms in the literature can easily confound a user attempting to select an algorithm suitable for the problem at hand [19]. Fortunately, researchers have already compared many clustering algorithms against each other and we can draw some useful conclusions from their work. Mangiameli et al. compared SOM and seven hierarchical clustering methods experimentally and found that SOM is superior to all of the hierarchical clustering methods [34]. Zhao and Ram compared K-means, hierarchical clustering, and SOM for clustering relational database attributes. They concluded that the three methods have similar clustering performances and SOM is better than the other two methods in visualizing clustering results [53], [54]. SOM is very effective for visualization of high-dimensional data. It produces a similarity graph of input data by converting the nonlinear statistical relationships between highdimensional data into simple geometric relationships of their
LIANG et al.: CLUSTERING OF LDAP DIRECTORY SCHEMAS FOR INFORMATION RESOURCES INTEROPERABILITY
Fig. 2.
633
Kohonen SOM network topology.
image points on a low-dimensional display, usually a regular two-dimensional grid of nodes (Fig. 2). Thereby it compresses information while preserving the most important topological and geometric relationships of the primary data elements on the display. It is then possible to visually identify clusters from the map. The main advantage of such a mapping is that it is possible to gain some idea of the structure of the data by observing the map, due to the topology-preserving nature of SOM. SOM is based on the associative neural properties of the brain. The Kohonen SOM network (Fig. 2) contains two layers of vectors: input vectors and mapping vectors usually in the shape of a two-dimensional grid. An input vector size is equal to the number of unique features associated with the input objects. Each vector of the mapping layer has the same number of features as an input vector. Thus, the input objects and the mapping cells can be represented as vectors that contain the input features. The mapping vectors are initialized with random numbers. Each actual input is compared with each vector on the mapping grid. The “winning” mapping vector is defined as that with the smallest distance (e.g., Euclidean distance) between itself and the input vector. The input thus maps to a given mapping vector. The value of the mapping vector is then adjusted to reduce the distance and its neighboring vectors may be adjusted proportionally. In this way, the multidimensional (in terms of features) input vectors are mapped to a two-dimensional output grid. After all of the input is processed (usually after hundreds or thousands of repeated presentations), the result should be a spatial organization of the input data organized into clusters of similar (neighboring) regions [40]. Typically, pattern-clustering activity involves the following steps [18]: 1) pattern representation (optionally including feature extraction and/or selection); 2) definition of a pattern proximity similarity measure appropriate to the data domain; 3) clustering or grouping;
4) data abstraction (if needed); and 5) assessment of output (if needed). There are no theoretical guidelines for the appropriate feature selection and extraction techniques to use in a specific situation [19]. Features selection and extraction strategies are usually very highly dependent on the specific application domain. In the case of the LDAP domain, we parse each object class to extract attributes that can be selected for the feature set. Similarity measure is a key component in clustering analysis. Measures of similarity and distance in vector spaces include correlation, direction cosines, Euclidean distance, measures of similarity in the Minkowski metric, Tanimoto similarity measure, weighted measures for similarity, and comparison by operations of continuous-valued logic [24]. Traditionally, Euclidean distance is used in SOM and we too use it in our work. Effective SOM clustering depends on the parameter values chosen for a certain application domain [39]. Since users usually are not aware of the structure present in the data (that is why clustering analysis is needed), it is not only difficult to determine what parameter values to use, it is also difficult to say when the map has organized into a proper cluster structure [1]. Therefore, simulations have to be run several times, using different parameter values, before selecting the best one [45]. We do not employ any additional data abstraction in our work. Cluster validity analysis is the assessment of a clustering procedure’s output. Usually cluster validity assessment is objective and is performed to determine whether the output is meaningful. In our application domain of LDAP directory metadata, the assessment of a clustering procedure’s output against a directory administrator’s judgment is very important in order to make our directory metadata clustering tool meaningful and useful to its users. Given the foregoing discussion, the focus of this paper is to apply SOM to the clustering of LDAP directory object classes and to determine the efficacy of our approach for finding good SOM parameter values that result in clustering comparable to human experts’ clustering.
634
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
Fig. 3. Object class inheritance [13].
III. U SING SOM FOR C LUSTERING LDAP O BJECT C LASSES A. SOM Input Features Selection Features selection is the process of identifying the most effective subset of the original features set to use in clustering. Features selection has profound influence on final SOM mapping results. The aim of good input features selection is to assign similar vector values to similar objects and at the same time mimic a human expert’s mental model. The purpose of features selection and extraction is to gather facts and conjectures about the data, optionally perform features selection and extraction, and design the subsequent elements of the clustering system. A careful investigation of the available features and any available transformations (even simple ones) can yield significantly improved clustering results [19]. In LDAP, object classes are used to group related information. Typically, an object class models some real-world object such as a person, printer, or network device. The definition of an LDAP object class includes the following pieces of information: 1) an object identifier (OID) that uniquely identifies the class; 2) a name that also uniquely identifies the class; 3) a set of mandatory attributes; 4) a set of allowed attributes. The name of an object class is usually mnemonic and easily understood by humans. A name is the most common way for humans, clients, and servers to refer to the object class. Some examples of object class names include person, printer, groupOfNames, and applicationEntity [13]. Attributes (also requiring both an OID and a name) that an object class definition includes must be unique throughout the entire directory schema. The set of mandatory (required) attributes is usually fairly short or even empty. The set of allowed (optional) attributes is often quite long. It is the job of each directory server to enforce attribute restrictions of an object class when an entry is added to the directory or modified in any way. One object class can be derived from another, in which case it inherits the characteristics of the other class. This is sometimes called subclassing, or object class inheritance. An example of this is shown in Fig. 3 [13]. An Object Class Example: The format for representation of object classes is defined in X.501 (ITU-T Recommendation
X.501 [17]). One of the most common uses for a directory system is to store information about people. An example object class schema definition in Sun ONE/iPlanet directory is the “person” objectclass: objectclasses: (2.5.6.6 NAME “person” DESC “Standard ObjectClass” SUP “top” MUST (objectclass $ sn $ cn) MAY (aci $ description $ seeAlso $ telephoneNumber $ userpassword)). The object class OID is 2.5.6.6. Its name is person. Its DESC value is Standard ObjectClass and its superior class is top. Every object class must have at least one MUST attribute. Person’s mandatory attributes are objectclass, sn, and cn and its allowed attributes are access control information (aci), description, seeAlso, telephoneNumber, and userpassword. ($ is used as separator.) An instantiation of the Person object class may be as shown in Fig. 4 [13]. This object class “represents” a person named Babs Jensen. It “requires” (MUST) attributes of the object class that provide naming information—the person’s surname (sn: Jensen) and common name (cn: Babs Jensen), as well as the names of its two superior object classes (objectclass: top, person). And it “allows” optional (MAY) attributes that provide a description, telephoneNumber, userpassword, and a seeAlso (cross reference to a) related name. Preliminary Decisions: We use OID, NAME, SUP, MUST, and MAY attributes as input vector features. SUP, MUST, and MAY attributes may be repeated in one or more object classes, but only those attributes that appear in at least three object classes are included with the input vector features. Although this selection may appear arbitrary (rather than, say, only including those attributes appearing in at least four object classes. . . or in just one. . .), experimenting with several kinds of attribute-feature-selection strategies found that this threshold provided good SOM clustering results. We included all OID and NAME values. Even though these were unique to an object class (and thus did not occur in three or more object classes), we observe that people place more emphasis on these “naming” attributes and, indeed, consider similarity in OID and NAME values as important for understanding the object class. For instance, person, organizationalPerson, inetOrgPerson are seen as similar by virtue of having the string “person” in their NAME. In general, we used Levenshtein’s distance measure [31] to measure string similarity between all character strings (except OID) contained in an object class and the input vector features. Because OID numerical strings are semantically like numeric Internet URL addresses, the similarity value between an OID string and an input vector feature is computed by comparing the common characters from the front of both strings. B. Selection of Parameter Values SOM parameters relate to the size of the map (x dim, y dim) that defines where objects are placed, the relationship of cells to other nearby cells (neighborhood size), the number of preliminary and fine-tuning passes used to test an input object’s distance to successive cells (initial and final training iterations), the speed with which map cells adjust to each other
LIANG et al.: CLUSTERING OF LDAP DIRECTORY SCHEMAS FOR INFORMATION RESOURCES INTEROPERABILITY
Fig. 4.
635
Sun ONE/iPlanet person object class.
(initial learning rate that reduces to a final learning rate), and the format of neighboring cells (neighborhood type—such as rectangular or hexagonal placement of cells). Given an output map initialized with random vectors, these SOM parameters govern how input features train the mapping vectors to their final configuration. SOM is expected to produce a topologically correct mapping between input and output spaces. This mapping is found with the Kohonen learning rule that is sensitive to its parameter values. A poor choice of parameter values results in a mapping that may be topologically unmeaningful to its users [35]. A major drawback of SOM has been the lack of a theoretically justified criterion for parameter values selection. Parameter values have a decisive effect on the reliability of visual analysis, which is the main strength of SOM [27]. Different researchers have used different SOM parameter values in their respective application domains. Lin et al. chose a 20 × 10 grid map for displaying SOM outputs, based on what would fit on an output screen [32]. Kohonen et al. used a 1 002 240-node SOM [26]. Kiang et al. used a 12 × 12 matrix with neighborhood size 5 and 20 000 iterations [21]. So, while SOM has been used widely in applications, it suffers from the difficulty of parameter value selection due to its heuristic origins [2], [3], [23]. The SOM parameter values are usually chosen ad hoc by the user [29] and it is usual practice to run SOM several times on different sets of parameter values and then pick the optimum one [45]. Although there are reports of successful applications of SOM algorithms to metadata [53], [54] and textual document clustering [21], [26], [32], we observe that: 1) evaluation is subjective—a different user may evaluate the same clustering result differently; and 2) clustering results depend on the selection of input SOM parameter values. Preliminary Experiments and Decisions: We did some initial SOM training and mapping experiments for 191 Sun ONE/iPlanet directory object classes to decide on the neighborhood type, the ratio between the final and initial training iterations, and the values for the initial and final learning rates. To make these decisions, we especially watched out for person, organizationalPerson, and inetOrgPerson. Since these are person-related object classes, they should be clustered together when good SOM parameter values are used in the SOM training process. We used this as the cluster performance criteria to evaluate the preliminary experiments. Fixing all the other parameter values, we compared the hexagonal neighborhood type with the rectangular type. We
found that the hexagonal neighborhood type was almost always better than the rectangular type. Similarly, when we compared final training iterations with the initial training iterations, we found that the clustering performance was very good with the final training iterations ten times the initial training iterations. We also tried combinations of the initial and final learning rates and found good clustering results with an initial learning rate of 0.05 and a final learning rate of 0.02. So we decided to vary x dim, y dim, neighborhood size, and final training iterations, and keep constant the initial learning rate (0.05), the final learning rate (0.02), the neighborhood type (hexagonal), and the ratio for initial to final training iterations (1:10). Focusing our research on the determination of values for the parameters x dim, y dim, neighborhood size, and the number of final training iterations, from hereon we will use the phrase “SOM parameter values” to mean these parameter values only. C. Research Question The purpose of our clustering investigation is to understand how LDAP directory users can better identify and use existing schemas. Associating similar object classes can assist the directory user in understanding object classes and thereby proactively promote the reuse of existing object classes. Ideally, computer-generated clustering of LDAP directory object classes should be as good as a human expert’s clustering results. That is, a group of LDAP directory object classes that are grouped together in an expert’s cluster should also appear together in the computer (SOM) cluster. Thus the research question for the current study is: Can SOM parameter values be identified such that a computer-generated clustering of LDAP directory object classes is comparable to (as good as) a human expert’s clustering? D. Metrics for Evaluating Clustering Results Since any SOM algorithm will, when presented with data, produce clusters—regardless of whether the data contain clusters or not—finding out which clustering result is meaningful and useful to users is important [19]. A number of clustering evaluation techniques exist in the literature.1 The most commonly used metrics are Cluster Recall and Cluster Precision [43] along with F-measure used to combine these metrics [30], 1 See
[20] for a broad comparison of such metrics.
636
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
[46], [49]. There is also literature on the use of human experts to evaluate computer-clustering performance [41], [42], [52]. After a study of the literature, we found the metrics used by Roussinov and Chen [41] particularly relevant and decided to use them in our study. Roussinov and Chen did an evaluation of documentclustering techniques through experiments with human subjects [41]. They used four metrics, Cluster Error (CE), Normalized Cluster Error (NCE), Cluster Recall (CR), and Cluster Precision (CP), which we also use in this study. They call a partition created by an expert a manual partition. An automatic partition is one created by a computer. Inside any partition, an association is a pair of documents belonging to the same cluster. Incorrect associations are those that exist in an automatic partition but do not exist in a manual partition. Missed associations are those that exist in a manual partition but do not exist in an automatic partition [41]. For our study, an object class corresponds to a document. The definitions of the four metrics are shown at the bottom of the page. The overall clustering performance is measured by the F-measure value [30], [46], [49]. F-measure is a mechanism to provide for an overall estimate of the combined effect of CP and CR. The F-measure is a standard evaluation metric in the field of information retrieval. The F-measure formula is expressed as F-measure =
(BETA∧ 2 + 1) ∗ CR ∗ CP (BETA∧ 2 ∗ CP) + CR
where BETA is the relative importance of clustering precision versus clustering recall; the higher the F-measure value, the better the clustering result. A BETA of 0 means that F = Cluster Precision; BETA of ∞ means that F = Cluster Recall. (BETA = 1 means recall and precision are equally weighted; BETA = 0.5 means recall is less important than precision; BETA = 2.0 means recall is more important than precision.) In our case, we choose BETA to be 1.0, assigning equal importance to CP and CR. IV. D ESIGN OF THE E MPIRICAL S TUDY A. Hypotheses Because of the heuristic origins of the SOM algorithm, theoretically generated parameter values do not necessarily ensure the best clustering. This may explain why a theoret-
CE = NCE =
ical approach is not widely adopted and users still choose SOM parameter values by ad hoc means that will provide results judged suitable for their specific domain. We observe that researchers use different SOM parameter values for their respective different application domains [21], [26], [32]. We propose that when applying SOM to a different (new) domain, we need to select different values, appropriate to the new domain, for the SOM parameters (x dim, y dim, neighborhood size, final training iterations). Further, good SOM parameter values cannot just be generated theoretically, but should be validated in some way, such as by domain experts. We have our first hypothesis according to this proposition. H10 : Using SOM parameter values that have provided a good clustering performance in another application domain will still do as well as human experts when applied to the domain of LDAP directory object classes. H11 : Using SOM parameter values that have provided a good clustering performance in another application domain will, however, not do as well as human experts when applied to the domain of LDAP directory object classes. As discussed later, the results of our experiment (Experiment 1) supported the above hypothesis, demonstrating that SOM parameter values that may work well in one domain, may not work well for a different domain. This suggests that we cannot simply borrow the SOM parameter values from a different domain and expect them to work well for a new domain such as LDAP directory metadata. We must find SOM parameter values specifically for the LDAP directory metadata domain. We propose that parameter values for the domain of LDAP directory metadata can be experimentally discovered. This suggests our second hypothesis. H20 : It is not possible, by an experimental approach, to find SOM parameter values that can cluster LDAP directory object classes as well as human experts. H21 : It is possible, by an experimental approach, to find SOM parameter values that can cluster LDAP directory object classes as well as human experts. Results of the second experiment supported the above hypothesis, showing that, indeed, we can experimentally discover SOM parameter values for the domain of LDAP directory metadata such that the clustering results are comparable to those of the human experts’. This did not, however, suggest that the SOM parameter values discovered experimentally from
total number of incorrect and missed associations total number of possible pairs of object classes total number of incorrect and missed associations total number of associations existing in both partitions
CR =
total number of correct associations in automatic partition total number of associations in manual partition
CP =
total number of correct associations in automatic partition total number of associations in automatic partition
LIANG et al.: CLUSTERING OF LDAP DIRECTORY SCHEMAS FOR INFORMATION RESOURCES INTEROPERABILITY
a set of object classes would generalize to a new set of object classes. Artificial neural networks do not automatically generalize. Typically, there are three necessary conditions for good generalization [44]. 1) Inputs to the network should contain sufficient information pertaining to the target, so that there exists a mathematical function with the desired degree of accuracy relating inputs to outputs. 2) The function being learned or trained (that relates inputs to outputs) be, in some sense, smooth. In other words, a small change in the inputs should produce a small change in the outputs. 3) Training cases should be a sufficiently large and representative subset of the set of all cases to be generalized. Although SOM is a neural network algorithm, it may be described as a nonlinear, smooth mapping of high-dimensional data onto a low-dimensional array [24], [29]. Therefore, it meets condition 2). If the experiment is properly designed, conditions 1) and 3) can be satisfied. Yet, meeting these three conditions is not sufficient for SOM with certain parameter values to be generalizable for a given application. Therefore, we have our third hypothesis. H30 : When we use the discovered SOM parameter values to train the SOM map and then map new data from the same domain, the resulting clusters will not be as good as those formed by human experts. H31 : When we use the discovered SOM parameter values to train the SOM map and then map new data from the same domain, the resulting clusters will be as good as those formed by human experts. Our second experiment included a step that tested results for a holdout dataset (HDS) to test the generalizability hypothesis. B. Experimental Design In order to test the research hypotheses described above, we designed two experiments. The first experiment was used to test H1 —whether we can use other published SOM parameter values in our application domain. We conducted the second experiment to test H2 —whether we can find “humancomparable” parameter values by trying different permutations of the possible values for the SOM parameters—and to test H3 —whether the resulting algorithm would be generalizable. In the second experiment we tried 320 different permutations of x dim and y dim values (in {3, 5, 7, 9}), neighborhood size ranging from 2 to 6, and final training iterations set as one of 10 000, 20 000, 30 000, or 40 000. In order to conduct the experiments, we gathered object class schema data from an LDAP directory and identified some human experts. We developed a research prototype for performing the experiments, conducted the two experiments, gathered data, and did statistical analysis. Choosing Experimental Data: Some major LDAP directory products on the market are Sun iPlanet, Novell eDirectory, Microsoft Active Directory, IBM SecureWay, and OpenLDAP. Although these directories have variations, they all conform to
637
LDAP directory schema definition standards. For example, each directory object class has OID, NAME, MUST attributes, MAY attributes. Most standard object classes have the SUP (superior) attribute (indicating inheritance from other objects). As long as we are using these directory schema standards in the SOM application, the experiment results are expected to be applicable for all LDAP directories. We chose iPlanet object classes as the experimental data in this instance. We extracted all 191 iPlanet object classes contained in an iPlanet directory and divided them into two groups randomly. Following usual practice [22], one group, with two thirds of the object classes, was used as the meta-training2 dataset (MTDS). The remaining one-third of the object classes, the HDS, was used for testing generalizability. The division of the data into two-thirds of object classes for the MTDS and one-third for the HDS was done as follows. We numbered the object classes from 1 to 191. We then used a Microsoft Excel random number function to assign each object class from 1 to 191 with a random number. We then selected the first 128 of these random numbers as the MTDS, leaving the others as the 1/3 HDS. A visual inspection of these datasets confirmed the randomness of the datasets. The MTDS was representative enough so that when mapping the HDS to the trained map, the trained map has enough knowledge of the new dataset to form proper clusters. That is, the MTDS is sufficiently large to be representative of all cases to which the algorithm needs to be generalized. Notation: We call all object classes (191 total) in iPlanet the universal dataset (UDS). The MTDS is 2/3 of the object classes (128) used to conduct SOM meta-training to find good parameter values. The HDS is the remaining 1/3 of the object classes (63) that are used to test the generalizability of the discovered SOM parameter values. Choosing Experts: To validate the effectiveness of the SOM algorithm parameter values, we compared computer clustering results to human experts’ clustering results. We defined human experts as directory administrators or researchers who work with directory object class schemas frequently. We invited six human experts, each with 6 months to 3 years of experience using LDAP directory object classes, to participate in the experiment. These experts were asked to cluster both MTDS and HDS object classes. They did their clustering work independently and with no time constraint. Each expert’s clustering results were compared to each of the other experts’ clustering results to calculate CE, NCE, CR and CP values. Research Prototype: The experiments were conducted using a prototype system called Semantic Facilitator (Fig. 5) [28], [50]. The prototype was implemented in a Microsoft Windows environment and using Kohonen SOM code packages [25] and displayed clustered LDAP objectClasses for the user (Fig. 6). Experiment 1—SOM using Kiang’s parameters: To see whether carefully selected SOM parameter values from another 2 We use the term “meta-training” instead of “training” since the dataset is used to find SOM parameter values that will result in a human-comparable performance of the resulting SOM algorithm; an SOM algorithm itself, then uses “training” to achieve its clustering result.
638
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
TABLE I AVERAGES FOR EXPERT-TO-EXPERT RESULTS
Fig. 5. Semantic facilitator system architecture [28], [50].
range of SOM parameter value sets by running 320 configurations and comparing each computer clustering result to each human expert’s clustering result. Based on Kohonen’s suggestions [25] and other researchers’ work [21], [32], we chose a set of values across which SOM parameter values could range: x dim, y dim values in {3, 5, 7, 9}; neighborhood size in {2, 3, 4, 5, 6}; final training iterations in {10 000, 20 000, 30 000, 40 000}. This resulted in 320 SOM parameter value set permutations being tested 1 1 ∗ PtrainIterations Px1 dim ∗ Py1dim ∗ PneighborSize
= 4 ∗ 4 ∗ 4 ∗ 5 = 320 Experimental Procedure: First use UDS to develop input vector features (again input vector size is 483). Generate input nodes for all 128 object classes in MTDS according to input vector features. Run SOM using each of 320 parameter value sets. Map 128 input nodes onto each of the 320 trained Kohonen maps, creating object class cluster maps. Calculate CE, NCE, CR, and CP values for each of the 320 object class maps by comparing the computer map results to each of the six experts’ clustering results for MTDS. Sort all 320 maps in descending order of F-measures [30], [46], [49]. The best map is the topmost map. Record the topmost map’s CE, NCE, CR, and CP values, and its parameter values. As the second step of this experiment, generate 63 input nodes for all object classes in HDS. Project these 63 input nodes onto the topmost map identified in the first step. Calculate CE, NCE, CR, and CP values by comparing computer map results to each of the six experts’ clustering of HDS. Fig. 6. Semantic facilitator clustered metadata.
V. E XPERIMENTAL R ESULTS application domain work well, Experiment 1 used the parameters of Kiang et al. (x dim = 12, y dim = 12, neighborhood size = 5, final training iterations = 20 000) [21]. Experimental Procedure: First use UDS to develop input vector features (vector size of 483). Generate input nodes for all 128 object classes in MTDS according to input vector features. Run SOM using the parameter values of Kiang et al. to create a map. Map 128 input nodes onto the trained map, creating an object class cluster map. Calculate CE, NCE, CR, and CP values by comparing computer map results to each of the six human experts’ clustering results for MTDS. As the second step of this experiment, generate 63 input nodes for all object classes in HDS. Project these 63 input nodes onto the map generated in the first step. Calculate CE, NCE, CR, and CP values by comparing HDS mapping results to each of the six experts’ clustering of the HDS. Experiment 2—SOM using 320 permutations of parameter value sets: This experiment was conducted to evaluate a wide
A. Baseline Comparison of Clustering by Six Experts We found considerable variation in the clusters obtained by the experts; this precluded the use of a “representative expert” or a “consensus” set of clusters. We therefore used the statistical mean and variance of the results from the human experts’ clustering as anonymous input to represent their collective results. Table I shows the results of comparing the six human experts’ clustering results to each other. The first row shows the MTDS results. The second row shows the HDS results. The CE values are calculated by averaging all five expert-to-expert CE values, which are computed by comparing each expert’s clustering results to the other five experts’ clustering results using the metric and taking the mean of the metric values. Similarly the NCE, smashCR, and CP are calculated by averaging all five expert-to-expert corresponding values and taking their mean.
LIANG et al.: CLUSTERING OF LDAP DIRECTORY SCHEMAS FOR INFORMATION RESOURCES INTEROPERABILITY
639
TABLE II CE, NCE, CR, AND CP VALUES (RESULTING FROM COMPARING COMPUTER-GENERATED CLUSTERING RESULTS TO HUMAN EXPERTS’ CLUSTERING RESULTS)
TABLE III TOP 10 (OF 320) COMPUTER-GENERATED SOM PARAMETER VALUE SETS
TABLE IV P -VALUE RESULTS FOR EXPERIMENT 1
TABLE V P -VALUE RESULTS FOR EXPERIMENT 2
The length of the training process depends on the SOM parameter values. It averaged about 2 min to finish SOM training on a Dell Pentium 3 computer with 640 MB memory. Comparison of Computer-Generated Clustering to Expert Clustering: Table II shows the values of comparing the computer-generated clustering results to the six human experts’ clustering results for Experiments 1 and 2. Table III shows the top 10 (based on best F-Measure) SOM parameter value sets (out of the total 320 parameter value sets) of Experiment 2—which compared computer-generated clustering to the human experts’ clustering (for MTDS). Comparison of Samples: We compared two samples’ mean equity. One sample is the set of results in Table I. The other sample is the set of results in the last column of Table II. We consider these two samples as distinct. A pooled variance t-test was employed. Microsoft Excel’s data analysis package was employed to do “t-test: two-sample assuming unequal variances analysis.” Tables IV and V show the values of computer-generated clustering results compared to clustering results of human experts for Experiments 1 and 2.
Below, we present the result for each research hypothesis. H1 . We observe from Table IV (P -Value Results for Experiment 1) that the SOM technique with the parameter values used by Kiang et al. gets as good CE and NCE values as human experts because both P -values are bigger than 5% level of significance, but it gets very low CR values for MTDS object classes compared to the corresponding values for the experts. The CR P -value is 0.001, much smaller than 5% level of significance, indicating that the SOM CR value is not comparable to the human experts’ CR value. Although we can say the SOM CP is as good as for the human experts because P -value is 0.963, the SOM F-measure value is still less than the human experts’ F-measure value (SOM F-measure value is 2 ∗ 0.177 ∗ 0.517/(0.177 + 0.517) = 0.263 whereas the human experts’ F-measure value is 2 ∗ 0.52 ∗ 0.52/(0.52 + 0.52) = 0.52. By F-measure, the human experts’ clustering results are much better than when using Kiang’s
640
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
SOM parameter values to obtain clustering results. Conclusion: Hypothesis 1 is supported by Experiment 1. H2 . The P -values for CE, NCE, CR, and CP in Table V show that the SOM clustering performance for MTDS clustering is comparable to that of the human experts’ at 5% level of significance. Looking at the CE, NCE, CR, and CP values (see Tables I and II), we find that the SOM values are very close to the human experts’ values. SOM CE, NCE, and CP values are better than those of the human experts’ while the SOM CR values are comparable to the corresponding values for the human experts. Conclusion: Hypothesis 2 is supported by Experiment 2. H3 . When we map the HDS object classes to the best trained map we found in Experiment 2 (x dim = 7, y dim = 9, neighborhood size = 2, iterations = 10 000), we still get good cluster results (see Table II, Experiment 2 results). Three metrics (CE, NCE, CR) are as good as the human experts’ with P -value at 5% level of significance (see Table V). While the P -value of CP is low (0.035), yet the SOM average CP value for HDS in Table II is 0.808 while the human experts’ average CP value for HDS in Table I is 0.687. This indicates, indeed, that the SOM CP is better than the human experts’ CP. Further, if we calculate the F-measure value for the HDS in Table I (expert-to-expert results), which is 2 ∗ 0.687 ∗ 0.687/(0.687 + 0.687) = 0.687, and compare it to the calculated F-measure value for Experiment 2 HDS average in Table II (comparing computergenerated clustering results to human experts’ clustering results), which is 2 ∗ 0.591 ∗ 0.808/(0.591 + 0.808) = 0.683, we corroborate that the computer clustering results are comparable to the experts’ clustering results. Conclusion: Hypothesis 3 is supported by Experiment 2. VI. C ONCLUSION AND D ISCUSSION Clustering of LDAP object classes at a level that matches human performance is a particularly important problem for reuse of directory metadata and semantic interoperability. We are not aware of any work on the clustering of LDAP metadata. There has been work on the clustering of metadata for relational databases [53], [54], but this work cannot be directly applied to LDAP schemas clustering. Also, it does not attempt to obtain human-comparable SOM parameter values as we do in the current study. Beyond deciding on how the data should be preprocessed, etc., an important issue is the selection of SOM parameter values. We investigated if the parameter values chosen by other researchers for a different domain would result in the performance desired in our LDAP domain. The results were negative. We therefore decided to conduct a fairly exhaustive search of SOM parameter values to find if there exists a set of parameter values that would meet the needed performance level. We found such a set of parameter values. This shows that SOM can be used to cluster LDAP metadata at the human-expert performance level. Finally, we looked at the generalizability of the solution. We set aside an HDS that
did not participate in the training of the SOM mapping nodes using the MTDS. When this HDS was mapped to the SOM mapping nodes, the results were also favorably comparable to the human experts’ clustering, indicating the generalizability of the solution. Limitations: Using SOM to cluster directory schema has five steps: 1) pattern representation (optionally including feature extraction and/or selection); 2) definition of a pattern proximity similarity measure appropriate to the data domain; 3) clustering or grouping; 4) data abstraction; and 5) assessment of output. There are many approaches to doing feature extraction and selection, and if we had tried more methods, we might have achieved even “better” results. We only used Euclidean distance for similarity measure. In investigating human-comparable SOM clustering, we only varied parameter values for map dimension size, neighborhood size, and training iterations, and did not address initial learning rate, final learning rate, or map topology. In evaluating SOM clustering results, we only used six human experts in assessing output. We have not assessed the external validity of the human experts’ evaluation, such as holding out some human experts to see whether they have the same evaluation results for the same directory metadata as the other experts. Clearly, it seems intuitive that including more experts can have an effect on the collective results. We believe the initial selection of six experts provides a reasonable range of expert input, and demonstrates the practicality of our experimental approach. There is interesting future work in this area as we consider the influence of the number of experts and their potentially varying LDAP domain expertise. With regard to generalizability, we note that one could argue that by letting the HDS participate in the building of the feature set, generalizability is lessened. That is, HDS attributes were included with the MTDS attributes in a UDS from which features were extracted. However, several factors need to be considered. First, since we only included attributes that appeared in three or more object classes, we used only 483 of more than 800 attributes. Second, the HDS did not participate in the metatraining to discover SOM parameter values, nor in the SOM training of the out layer (the mapping layer). Third, the HDS included 63 object classes (1/3 of all the object classes) and all of these still mapped well, demonstrating generalizability. Finally, we observe that the concept of a UDS makes sense and is practical: given a new object class, why not include it in finding optimal parameter values for SOM? Further Research: We are extending our work in several directions. One is to find human-comparable parameter values using a genetic algorithm. Another is to provide an expert interface so that the SOM algorithm can “learn” as it sees more data and has the benefit of using the expertise of more experts. Another is to explore the options for achieving human-comparable clustering without recourse to experts directly, such as, perhaps using reference datasets (objects that we expect to cluster closely) to measure clustering results. Another area of future work is to investigate approaches for providing directory schema discovery and management via scaleable web services components in an online, dynamic environment.
LIANG et al.: CLUSTERING OF LDAP DIRECTORY SCHEMAS FOR INFORMATION RESOURCES INTEROPERABILITY
ACKNOWLEDGMENT The authors are indebted to the anonymous referees for their careful reading of the paper and their constructive suggestions for its improvement. The authors gratefully acknowledge Prof. A. Srivastava at Georgia State University for expert advice on statistical analysis. D. Kuechler, a doctoral student in the CIS department at Georgia State University, contributed to earlier versions of this paper before his untimely death in November 2002. R EFERENCES [1] D. Alahahoo and S. K. Halgamuge, “Dynamic self-organizing maps with controlled growth for knowledge discovery,” IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 601–614, May 2000. [2] C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995. [3] C. M. Bishop, M. Svensén, and C. G. Williams, “TM: The generative topographic mapping,” Neural Comput., vol. 10, no. 1, pp. 215–234, 1998. [4] W. Cheung, S. Bahri, and G. Babin, “An object oriented shell for distributed processing and systems integration,” in Working Paper. Shatin, N.T., Hong Kong: Decision Sci. and Managerial Econ. Dept., Chinese Univ. Hong Kong, 2003. [5] Internet2. (2004). Directories [Online]. Available: http://middleware. internet2.edu/core/directories.html [6] Dublin Core Metadata Initiative, Dublin Core Metadata Element Set, Version 1.0: Reference Description, 1995–2003. [Online]. Available: http://dublincore.org/documents/1998/09/dces/ [7] Net@Edu. (2003). eduPerson Object Class [Online]. Available: http:// www.educause.edu/eduperson/ [8] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid, enabling scalable virtual organizations. Int. J. Supercomput. Appl. [Online]. 15(3), pp. 200–222. Available: http://www.globus.org/research/papers/ anatomy.pdf [9] GLUE Schema. (2002, Oct. 16). The Globus Project [Online]. Available: http://www.globus.org/mds/glueschemalink.html [10] M. T. Hansen, N. Nohria, and T. Tierney, “What’s your strategy for managing knowledge?” Harvard Bus. Rev., vol. 77, no. 2, pp. 106–116, 1999. [11] S. Hayward, J. Graff, and N. MacDonald, “Business Strategy Will Drive Directory Services,” GartnerGroup, Stamford, CT, TG-07-4615, Mar. 11, 1999. [12] S. Heiler, “Semantic interoperability,” ACM Comput. Surv., vol. 27, no. 2, pp. 271–273, 1995. [13] T. Howes, M. Smith, and G. S. Good, Understanding and Deploying LDAP Directory Services, 2nd ed. Boston, MA: Addison-Wesley, 2003. [14] C. Hsu, M. Bouziane, L. Rattner, and L. Yee, “Information resources management in heterogeneous, distributed environments: A metadatabase approach,” IEEE Trans. Softw. Eng., vol. 17, no. 6, pp. 604–625, Jun. 1991. [15] C. Hsu, Enterprise Integration and Modeling—The Metadatabase Approach. Boston, MA: Kluwer, 1996. [16] Internet2. (2003). Internet2 Middleware Initiative [Online]. Available: http://www.internet2.edu/middleware/ [17] International Telecommunication Union. (2003). ITU-T Recommendation X.501f [Online]. Available: http://www.itu.int/rec/recommendation. asp?type=folders&lang=e&parent=T-REC-X.501 [18] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, PrenticeHall Advanced Reference Series. Upper Saddle River, NJ: Prentice-Hall, 1988. [19] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. [20] T. C. Jo, “Evaluation of document clustering based on term entropy,” in Proc. Int. Symp. Advanced Intelligent Systems, Daejon, Korea, 2001, pp. 302–306 [Online]. Available: http://www.discover. uottawa.ca/~taeho/Publication/ic2001_08.pdf [21] M. Y. Kiang, U. R. Kulkarni, and K. Y. Tam, “Self-organizing map networks as an interactive clustering tool—An application to group technology,” Dec. Support Syst., vol. 15, no. 4, pp. 351–374, 1995. [22] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. 14th Int. Joint Conf. Artificial Intelligence (IJCAI), Montreal, Canada, 1995, pp. 1137–1143. [23] T. Kohonen, Self-Organizing Maps. Berlin, Germany: Springer-Verlag, 1995.
641
[24] ——, Self-Organizing Maps. Berlin, Germany: Springer-Verlag, 2001. [25] T. Kohonen, J. Hynninen, J. Kangas, and J. Laaksonen, SOM_PAK: The Self-Organizing Map Package, Version 3.1. Helsinki, Finland: Prepared by the SOM Programming Team, Helsinki Univ. Technol., Apr. 7, 1995. [26] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and A. Saarela, “Self organization of a massive document collection,” IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 574–585, May 2000. [27] T. Kostiainen and J. Lampinen, “On the generative probability density model in the self-organizing map,” Neurocomputing, vol. 48, pp. 217–228, Oct. 2002. [28] D. Kuechler, V. Vaishnavi, and A. Vandenberg, “An architecture to support communities of interest using directory services capabilities,” in Proc. Hawaii Int. Conf. System Sciences, Track 9, Big Island, HI, 2003, p. 287b. [29] J. Lampinen and T. Kostiainen, “Overtraining and model selection with the self-organizing map,” in Proc. Int. Joint Conf. Neural Network, Washington, DC, Jul. 1999, pp. 1911–1915. [30] B. Larsen and A. Aone, “Fast and effective text mining using lineartime document clustering,” in Proc. 5th Association for Computing Machinery Knowledge Discovery and Data mining (ACM SIGKDD) Int. Conf. Knowledge Discovery and Data Mining, San Diego, CA, 1999, pp. 16–22. [31] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Sov. Phys. Dokl., vol. 10, no. 8, pp. 707–710, 1966. [32] C. Lin, H. Chen, and J. F. Nunamaker, “Verifying the proximity and size hypothesis for self-organizing maps,” J. Manage Inf. Syst., vol. 16, no. 3, pp. 57–70, 1999. [33] Internet2. (2004). MACE-Dir [Online]. Available: http://middleware. internet2.edu/dir/ [34] P. Mangiameli, S. K. Chen, and D. West, “A comparison of SOM neural network and hierarchical clustering methods,” Eur. J. Oper. Res., vol. 93, no. 2, pp. 402–417, 1996. [35] M. McInerney and A. Dhawan, “Training the self-organizing feature map using hybrids of genetic and Kohonen methods,” in Proc. IEEE Int. Conf. Neural Networks, Orlando, FL, 1994, pp. 641–644. [36] A. Mowshowitz, “Virtual organization: A vision of management in the information age,” Inf. Soc., vol. 10, no. 4, pp. 267–288, 1994. [37] NMI. (2003). NSF Middleware Initiative, NMI-R4 Software Download Center [Online]. Available: http://www.nsf-middleware.org/NMIR4/ download.asp [38] A. Nürnberger, “Clustering of document collections using a growing selforganizing map,” in Proc. Berkeley Initiative Soft Computing (BISC) Int. Workshop Fuzzy Logic and Internet, Berkeley, CA, 2001, pp. 136–141. [39] D. Polani and T. Uthmann, “Training Kohonen feature maps in different topologies: An analysis using genetic algorithms,” in Proc. 5th Int. Conf. Genetic Algorithms, San Mateo, CA, 1993, pp. 326–333. [40] D. G. Roussinov and H. A. Chen, “A scalable self-organizing map algorithm for textual classification: A neural network approach to automatic thesaurus generation,” Commun. Cogn. Artif. Intell. J., vol. 15, no. 1–2, pp. 81–111, 1998. [41] D. G. Roussinov and H. Chen, “Document clustering for electronic meetings: An experimental comparison of two techniques,” Dec. Support Syst., vol. 27, no. 1–2, pp. 67–79, 1999. [42] M. Sahami, S. Yusufali, and Q. W. Baldonado, “SONIA: A service for organizing networked information autonomously,” in Proc. 3rd ACM Int. Conf. Digital Libraries, Pittsburgh, PA, 1998, pp. 237–246. [43] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. [44] W. S. Sarle. (2002). How is Generalization Possible? [Online]. Available: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-1.html [45] S. Sestito and T. S. Dillon, Automated Knowledge Acquisition. Englewood Cliffs, NJ: Prentice-Hall, 1994. [46] B. Stein and S. M. Z. Eissen, “Document categorization with Major CLUST,” in Proc. of the 12th Workshop on Information Technologies and Systems, Barcelona, Spain, 2002, pp. 91–96. [47] A. Tiwana and B. Ramesh, “e-Services: Problems, opportunities, and digital platforms,” in Proc. 34th Annu. Hawaii Int. Conf. System Sciences, Maui, HI, p. 3018. [48] V. Vaishnavi and W. Kuechler, “Universal enterprise integration: Challenges and approaches to web-enabled virtual organizations,” Inf. Technol. Manage., vol. 6, no. 1, pp. 5–16, 2005. [49] C. Van Rijsbergen, Information Retrieval, 2nd ed. London, U.K.: Butterworth, 1979. [50] A. Vandenberg, J. Liang, B. Bolet, H. Kou, V. Vaishnavi, and D. Kuechler, “Research prototype: Semantic facilitator for LDAP Directory Services,”
642
[51]
[52] [53]
[54]
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 36, NO. 4, JULY 2006
in Proc. 12th Annu. Workshop on Information Technologies and Systems, Barcelona, Spain, 2002, p. 251. A. Vandenberg, V. Vaishnavi, and C. Shaw, “Promoting semantic interoperability of metatdata for directories of the future,” in Internet2 Fall Member Meeting, Indianapolis, IN, 2003 [Online]. Available: http://www. internet2.edu/presentations/fall-03/20031016-Middleware-Vandenberg.htm O. Zamir, O. Etzioni, O. Madani, and R. M. Karp, “Fast and intuitive clustering of web documents,” in Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining, Newport Beach, CA, 1997, pp. 287–290. H. Zhao, “Combining schema and instance information for integrating heterogeneous databases: An analytical approach and empirical evaluation,” Ph.D. dissertation, Manage. Inform. Syst., Univ. Arizona, Tucson, AZ, 2002. H. Zhao and S. Ram, “Clustering schema elements for semantic integration of heterogeneous data sources,” J. Database Manage., vol. 15, no. 4, pp. 89–106, 2004.
Jianghua Liang was born in Gaoan, Jiangxi Province, China, in 1973. He received the B.E. degree in mining engineering from the Southern Institute of Metallurgy, Ganzhou, Jiangxi Province, in 1995; the M.E. degree in mining engineering from the University of Science and Technology, Beijing, China, in 1998; the Ph.D. degree in geological engineering from the University of Arizona, Tucson, in 2001; and the M.S. degree in computer information systems from Georgia State University, Atlanta, in 2003. He worked as a Graduate Research Assistant at Advanced Campus Services of Georgia State University during 2001–2003. He joined Concept WorldWide, Inc. after he graduated from Georgia State University. He is currently a Senior Software Application Consultant with Lexmark International, Inc., Lexington, KY. Dr. Liang is a member of the Atlanta Graduate Business Association.
Vijay K. Vaishnavi (SM’89–F’01) received the BE degree (with distinction) in electrical engineering from the National Institute of Technology, Srinagar, India, in 1969 and the M.Tech. and Ph.D. degrees in electrical engineering (with a major in computer science) from the Indian Institute of Technology, Kanpur, India, in 1971 and 1976, respectively. He did postdoctoral work in computer science for two years at McMaster University, Hamilton, ON, Canada. He is currently the Board of Advisors Professor of Computer Information Systems at the Robinson College of Business and Professor of Computer Science, Georgia State University in Atlanta. The National Science Foundation and private organizations, including IBM, Nortel, and AT&T, have supported his research. He has authored numerous papers in these and related areas. His papers have appeared in ACM Computing Surveys, IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, IEEE TRANSACTIONS ON COMPUTERS, SIAM Journal on Computing, Journal of Algorithms, and several other major international journals and conference proceedings. He is a coauthor of Object Technology Centers of Excellence, (Manning Publications/Prentice-Hall, 1996). His current areas of research interest include interorganizational systems (information integration, semantic interoperability, directory services, web-based virtual communities, coordination, process knowledge management, security), software development (project management, object-oriented metrics, software specifications and their maturity, object-oriented modeling and design), and data structures and algorithms (multisensor networks and fusion). Dr. Vaishnavi is a member of the IEEE Computer Society, the Association for Computing Machinery (ACM), and the Association for Information Systems (AIS). He has been an Associate Editor, Editor, Guest Editor, and Editorial Board Member of several journals. He is on the program committees of several conferences and workshops.
Art Vandenberg (M’04) was born in Grasonville, MD, in 1950. He received the B.A. degree in English literature from Swarthmore College, Swarthmore, PA, in 1972, the M.V.A degree in painting and drawing from Georgia State University, Atlanta, in 1979, and the M.S. degree in information and computer systems from Georgia Institute of Technology, Atlanta, in 1985. He has worked in library systems, research, and administrative computing since 1976, with 15 years experience in various information technology positions at Georgia Institute of Technology. Since 1997, he has been with Information Systems and Technology at Georgia State University, Atlanta, where he is Director of Advanced Campus Services charged with implementing middleware infrastructure and supporting research computing. In his first two years with Georgia State he was responsible for administrative computing and led the campus-wide Y2K remediation project. His current middleware activities include participation in the NSF Middleware Initiative’s Integration Testbed, deploying directory and Grid components, and collaborating with faculty researchers on middleware research and Grid technology deployment. He is Co-PI on a National Science Foundation Information Technology Research grant investigating a unique approach to resolving metadata heterogeneity. Mr. Vandenberg is a member of the Association of Computing Machinery and the IEEE Computer Society.