Soft Learning Vector Quantization and Clustering ... - Semantic Scholar

1 downloads 0 Views 712KB Size Report
Nicolaos B. Karayiannis, Senior Member, IEEE, and Mary M. Randolph-Gips ...... Lasker and X. Liu, Eds. Windsor, ON, Canada: IIAS, 1995, pp. 90–94. [34] G. J. ...
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

423

Soft Learning Vector Quantization and Clustering Algorithms Based on Non-Euclidean Norms: Single-Norm Algorithms Nicolaos B. Karayiannis, Senior Member, IEEE, and Mary M. Randolph-Gips

Abstract—This paper presents the development of soft clustering and learning vector quantization (LVQ) algorithms that rely on a weighted norm to measure the distance between the feature vectors and their prototypes. The development of LVQ and clustering algorithms is based on the minimization of a reformulation function under the constraint that the generalized mean of the norm weights be constant. According to the proposed formulation, the norm weights can be computed from the data in an iterative fashion together with the prototypes. An error analysis provides some guidelines for selecting the parameter involved in the definition of the generalized mean in terms of the feature variances. The algorithms produced from this formulation are easy to implement and they are almost as fast as clustering algorithms relying on the Euclidean norm. An experimental evaluation on four data sets indicates that the proposed algorithms outperform consistently clustering algorithms relying on the Euclidean norm and they are strong competitors to non-Euclidean algorithms which are computationally more demanding. Index Terms—Clustering, generator function, learning vector quantization (LVQ), non-Euclidean norm, reformulation, reformulation function, weight matrix, weighted norm.

I. INTRODUCTION

C

ONSIDER the set which is formed by feature vectors, that is, , , . Clustering is the process of partitioning the feature vectors to clusters. Vector quantization can be into the seen as a mapping from an -dimensional space . Clustering finite set of prototypes algorithms are typically developed to solve a constrained minimization problem involving two sets of unknowns, namely, the membership functions, that assign feature vectors to clusters, and the prototypes. The solution of such problems is often determined using alternating optimization [4], [14]. These clustering techniques include the fuzzy -means [4], the generalized fuzzy -means [23], and the entropy-constrained fuzzy clustering (ECFC) [18] algorithms. Cluster analysis often employs unsupervised learning procedures originally developed for vector quantization, which is a predictive learning problem [7]. As an example, clustering can be performed by unsupervised learning vector quantization (LVQ) algorithms, whose implementation Manuscript received January 5, 2002; revised July 2, 2002. N. B. Karayiannis is with the Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77204 USA (e-mail: [email protected]). M. M. Randolph-Gips is with the Department of Computer Engineering, University of Houston-Clear Lake, Houston, TX 77058 USA. Digital Object Identifier 10.1109/TNN.2004.841778

relies on a competitive neural network trained using gradient descent [5], [19]–[22], [24]–[31], [38]. The connection between LVQ and clustering was pointed out by Karayiannis and Bezdek [26], who identified a close relationship between fuzzy LVQ (FLVQ) and fuzzy -means (FCM) clustering algorithms. This relationship led to reformulation, which is a methodology developed to reduce an objective function treated by alternating optimization to a reformulation function that involves only one set of unknowns, namely the prototypes [20], [21]. It was subsequently shown that reformulation can be used instead of alternating optimization for the development of soft LVQ and clustering algorithms, which include FLVQ and clustering algorithms as special cases [22], [24], [25]. LVQ and clustering algorithms often rely on the Euclidean norm to measure the distance between the feature vectors and the prototypes. The use of the Euclidean norm presumes that the data are organized in hyperspherical clusters, which is not guaranteed in practical situations. This problem can be dealt with by employing norms other than the Euclidean. The development of non-Euclidean clustering algorithms was attempted by relying on measures other than the norm to determine the distance between the feature vectors and the prototypes [6], [13], [15], [17]. Although the use of such norms offers certain advantages, such as suppressing the effect of outliers in the data, the implementation of the resulting algorithms is not a simple task. The development of non-Euclidean clustering algorithms was also attempted by relying on weighted inner product norms to measure the distance between the feature vectors and the prototypes [12], [16], [35], [36]. One of the earliest approaches in this family was introduced by Gustafson and Kessel [12], who determined the weight matrices involved in the definition of the norms from the data during the clustering process. The weight matrices were determined in this approach by requiring that their determinants be fixed [16]. Thus, the Gustafson–Kessel (GK) algorithm is sensitively dependent on a set of free parameters involved in the constraints imposed on the determinants of the weight matrices [4]. Gath and Geva [11] proposed a fuzzy clustering algorithm that employs non-Euclidean distance norms and an unsupervised procedure for selecting the number of clusters. The Gath–Geva (GG) algorithm relies on an “exponential” distance measure based on maximum likelihood estimation [16]. According to this approach, the distance between each prototype and the feature vectors is computed in terms of the fuzzy covariance matrix of the corresponding cluster. The fuzzy covariance matrix is indeed the link between this approach and the GK algorithm, which computes the weight matrix of the norm

1045-9227/$20.00 © 2005 IEEE

424

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

assigned to each prototype in terms of the fuzzy scatter matrix of the corresponding cluster. Note that the fuzzy covariance matrix of each cluster is a scaled version of its fuzzy scatter matrix [4]. Krishnapuram and Kim [35] pointed out that the GK algorithm is better suited for ellipsoidal clusters and it works better when the volumes of the clusters are roughly equal. In an attempt to develop alternatives to the GK algorithm, they indicated that the hard -means and fuzzy -means algorithms achieve their best performance in the presence of spherical clusters of roughly equal volumes [36]. This behavior was attributed to the fact that both algorithms are essentially developed by minimizing a trace criterion. As an alternative, they developed the minimum scatter volume algorithm and the minimum cluster volume algorithm by minimizing determinant measures that may be interpreted as volume criteria. Clustering algorithms were traditionally developed by solving a constrained minimization problem involving the prototypes and the membership functions. Reformulation reduces the development of LVQ and clustering algorithms into the unconstrained minimization of a function that is formed only in terms of the prototypes. According to this formulation, the membership functions are obtained as a by-product of the minimization process, while their form is determined by the specific aggregation operator used to form the reformulation function. By eliminating the membership functions as unknown parameters, reformulation allows the involvement of additional constraints in the minimization problem. This extra degree of freedom offered by reformulation can be exploited to address certain disadvantages of existing LVQ and clustering algorithms. As an example, according to the formulation proposed in this paper, the development of clustering and LVQ algorithms employing a weighted norm reduces to the selection of the constraint imposed on the norm weights. The formulation proposed in this paper considers a weighted norm defined in terms of a diagonal weight matrix and imposes a constraint on the generalized mean of its diagonal entries. This approach leads to easily implementable and computationally efficient clustering and LVQ algorithms.

, and is a monotonically increasing (decreasing) . function of The features that compose a feature vector are not equally suited for cluster separation [37]. The suitability of the features depends mainly on their variance. More specifically, the features with the smallest variance provide a more reliable basis for cluster formation. Clustering algorithms relying on the Euclidean norm place an equal importance on all the features that compose the feature vectors regardless of their variance. Therefore, the use of the Euclidean norm in LVQ and clustering is appropriate if the feature vectors are organized in hyperspherical clusters, which is rarely the case in practice. One way to remedy this problem is to perform LVQ or clustering in a new space obtained from the original feature space through a linear transformation. , , be a set of feature vectors Let , . The distance represented by the prototypes and the prototypes can between the feature vectors be measured by the norm (2) where is a norm-inducing matrix that is required to be positive definite. Depending on the choice of , (2) leads to popular distance measures such as the Mahalanobis and Euclidean and the norms. The distance between the feature vectors is very often measured by the Euclidean norm prototypes , which can be obtained as . the special case of (2) that corresponds to Consider the new space generated by transforming into a vector through the each vector , where . Suppose linear transformation and is the distance between measured by the Euclidean norm (3) According to (3), the Euclidean norm on the new space corresponds to the weighted norm (4)

II. SOFT LVQ AND CLUSTERING BASED ONA WEIGHTED NORM The development of soft LVQ and clustering algorithms relying on the Euclidean norm can be accomplished by minimizing a reformulation function using gradient descent. Ac, , cording to this formulation, the prototypes are determined in terms of the feature vectors , , by solving the unconstrained minimization problem [21]

. This implies that cluson the original feature space based tering the feature vectors in the feature space on the weighted norm (4) is equivalent to clustering the transbased on the formed feature vectors in the space Euclidean norm. The development of soft LVQ and clustering algorithms relying on a weighted norm can be accomplished by minimizing an admissible reformulation function of the form

(1)

(5)

where is an admissible reformulation function and is the , that is, matrix whose column vectors are the prototypes . A function of the form (1) is an admissible reformulation function of the first (second) kind if and are differentiable everywhere functions of satisfying the condition , and are both monotonically decreasing (increasing) functions of

If

, then the weighted norm reduces to and (5) coincides with the reformulation function (1). , then the entries of the weight matrix constitute If a new set of free parameters that can be determined during the LVQ or clustering process in order to improve the quality of the resulting partitions. This can be accomplished by imposing and by certain constraints on the entries of the weight matrix

KARAYIANNIS AND RANDOLPH-GIPS: SOFT LEARNING VECTOR QUANTIZATION AND CLUSTERING ALGORITHMS

incorporating these constraints into the minimization problem to from the data. determine , where In this approach, it is assumed that denotes the set of all real diagonal can matrices. The weight matrix , , and by be determined by requiring that imposing an additional constraint on the generalized mean of , defined as [8], [9], [34], and [39]

425

III. MINIMIZATION PROBLEM Non-Euclidean soft LVQ and clustering algorithms can be developed by minimizing the reformulation function defined in (5) for , where

(11) (6) with . As , the generalized mean (6) approaches the minimum of . For , the generalized mean (6) coincides with the harmonic mean of

(7) As , the generalized mean (6) approaches the geometric mean of

This constrained minimization problem can be solved using alternating optimization. According to this optimization strategy, can be determined by assuming that the the prototypes is fixed and by using the gradient descent weight matrix method to minimize for . The weight can be determined by assuming that the prototypes matrix are fixed and by using the method of Lagrange multipliers for to perform the constrained minimization of . The update equations for the prototypes are obtained in Appendix A as (12)

(8) For , the generalized mean (6) coincides with the arithmetic mean of

where

are the competition functions, defined as (13) ,

(9)

and

,

As , the generalized mean (6) approaches the maximum of . The diagonal entries of the weight matrix can be constrained by requiring that their generalized mean be constant, that is (10) The constant can be determined by requiring that condition (10) be compatible with the Euclidean norm, which corre. If , then , , and (10) sponds to . For and , (10) takes the form is satisfied with . For , the constraint guaris constant, that is, antees that the trace of the weight matrix . A similar constraint was considered in [10] and trace and , the generalized mean approaches [32]. For the geometric mean and (10) becomes . The conresembles the constraints involved in the dition development of a simplified version of the GK algorithm relying on diagonal weight matrices [16] and was also considered is a diagonal matrix, the constraint in [33]. If guarantees that its determinant is constant, that is, . This observation establishes an additional link between this approach and the formulation proposed by Gustafson and Kessel [12], which also imposed constraints on the determinants of the weight matrices involved in the weighted norms used to measure the distance between the feature vectors and the prototypes.

with representing the th entry of each prototype . The competitive learning scheme described by the update equation in (12) can be implemented in an iterative fashion. If are the prototypes obtained after the th iteracan be determined at the tion, the new set of prototypes th iteration according to (12) as learning rate for the

(14) The competitive learning scheme described by (14) can be implemented as a batch clustering algorithm by constraining the learning rates in such a way that the new prototypes be . Accomputed only in terms of the feature vectors cording to (14), this condition is satisfied if (15) If (15) is satisfied, then the update equation in (14) reduces to the “centroid” formula

(16)

426

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

which is the common ingredient of a variety of batch clustering algorithms developed using alternating optimization. Note that is not directly involved in the estimation of the weight matrix the prototypes due to the requirement imposed for determining has an indirect effect on the learning rates. Nevertheless, the estimation of the prototypes due to its involvement in the computation of the competition functions. , the diagonal entries of are obtained For in Appendix B using the method of Lagrange multipliers as (17) where (18) In the limit as

, the weights

are obtained in Appendix B

(19) which can be written in terms of the geometric mean of as . For , the weights (17) can be written in terms of the harmonic mean of as . IV. NON-EUCLIDEAN LVQ AND CLUSTERING ALGORITHMS The development of specific LVQ and clustering algorithms that can reduces to the selection of a family of functions be used to construct admissible reformulation functions of the form (5) [20], [21], [25]. A broad variety of reformulation funcof the form tions can be constructed using functions , , where is called the generator function. Reformulation functions of the first or the second kind can be constructed in terms of increasing or decreasing generator functions that satisfy certain admissibility conditions out, the increasing generator function lined in [21]. For produces the FCM and FLVQ algorithms. If ( ), the generator function corresponds to a reformulation function of the first (second) kind. corresponds to The generator function and . For this generator func, , and tion, (13) gives (20) Since functions

learning scheme produced by this formulation is implemented as an LVQ algorithm, the value of is not fixed during learning. Decreasing values of produce descending FLVQ algorithms [38]. In such a case, codebook design can be seen as a determinplaying the role of the system istic annealing process with temperature. In practice, is often allowed to decrease linearly to a final value in a predefrom an initial value termined number of iterations [38]. The non-Euclidean fuzzy LVQ (NEFLVQ) algorithm produced by the generator function is summarized in Table I. For a fixed value of , the update scheme produced by reformulation can be implemented as a clustering algorithm. In such used to assign feature a case, the membership functions vectors to clusters can be obtained from the competition funcas [20], [21], [26]. The memtions bership functions corresponding to the linear generator function can be obtained from (21) as

, the competition can be obtained using (20) as

(22) According to its definition in (18), is proportional to the th diagonal entry of the fuzzy in-cluster scatter matrix, defined as [4] (23) The fuzzy in-cluster scatter matrix defined in (23) can also be written as , where is the fuzzy scatter matrix of the th cluster. This establishes a link between the proposed algorithm and the GK algorithm, which also employs the fuzzy scatter matrix of the th cluster to compute the distance between the corresponding prototype and the feature vectors. Note also that can be used to obtain the fuzzy cothe fuzzy scatter matrix variance matrix of the th cluster as . This establishes a link between the proposed algorithm and the GG algorithm, which employs the fuzzy covariance matrix. , (22) gives the membership functions of the FCM For algorithm [4]. This algorithm relies on the Euclidean norm to produce fuzzy -partitions of the feature vectors, that is, partitions that satisfy the condition (24) It can be verified that the membership functions obtained from also satisfy (22) for any admissible weight matrix condition (24). Thus, the corresponding clustering algorithm also produces fuzzy -partitions of the feature vectors but it relies on the weighted norm to measure the distance between the feature vectors and the prototypes. The non-Euclidean fuzzy -means (NEFCM) algorithm produced by the generator funcis summarized in Table I. tion

(21) V. ERROR ANALYSIS For , (21) gives the competition functions used to implement the batch FLVQ algorithm [5], [20], [21], [26], [38]. If the

The analysis presented in this section provides the basis for establishing a criterion that can be used to select the value of

KARAYIANNIS AND RANDOLPH-GIPS: SOFT LEARNING VECTOR QUANTIZATION AND CLUSTERING ALGORITHMS

, the residual error For the th feature can be obtained using (17) as

TABLE I NON-EUCLIDEAN FLVQ AND FCM ALGORITHMS

427

due to

(29) where

is the generalized mean of . As , and (29) gives , . Since are all equal, all features have a the residual errors . Unlike the case uniform effect on the total residual error , the residual errors are not equal for where . Since for , (29) also indiis a decreasing function of . For , in cates that . Thus, increasing particular, the value of from 0 to 1 reduces the effect of the features corresponding to the largest among the values on the total residual error. This implies that as increases from 0 to 1 the features corresponding to the largest among have a progressively diminishing the values impact on the partition produced by the corresponding LVQ or clustering algorithms. This is in direct contrast with the effect of different features on the partitions produced by LVQ and clustering algorithms relying on the Euclidean norm, which and . For such algorithms, the correspond to most significant contribution to the total residual error comes from the features corresponding to the largest among the values . , the total residual error For can be obtained using (29) as and

if

. For this generator function, , , and the reformulation function (5) takes the form

(25) where inition of

. Using the def, (25) can be written as (26)

(30) According to (20), the reformulation function corresponding to the linear generator function can be written in terms of its corresponding competition functions as (27) If the competition functions , the prototypes , and the are determined at each iteration as indicated weight matrix in the previous sections, then represents the total residual error after this iteration. Since , the total residual error can be obtained at each iteration as (28) where , and is defined in (18). According to (28), represents the contribution of the th feature to the total is referred to as the residual error residual error . Thus, due to the th feature. The analysis which follows investigates the impact of the parameter on the contribution of different features to the total residual error.

, and . For As , and . Note also that the residual error corresponding to the Euclidean norm can be as . The effect obtained for on the residual error can be revealed by of the value of as spans the interval (0, 1]. As increases the behavior of decreases from 0 to . Moreover, from 0 to 1, can be obtained from the arithmetic mean of of the generalized mean for . Since the generalized mean is correan increasing function of [9], the residual errors and different values of and the residual sponding to corresponding to (Euclidean norm) satisfy the error inequality (31) are equal. with the equalities holding if According to the previous analysis, selecting the value of involves a tradeoff between reducing the total residual error and balancing the effect of the features on the total residual error. A value of close to 0 tends to balance the contribution of residual errors due to different features to the total residual error. On the other hand, a value of close to 0 tends to increase the value of the total residual error. The total residual error can be reduced

428

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

by increasing the value of . However, the effect of different features to the residual error becomes increasingly nonuniform is a deas the value of increases from 0 to 1. Since for , the contribution of the creasing function of due to the th feature to the total residual residual error increases. If there are significant differerror decreases as corresponding to difences among the values ferent features, increasing the value of from 0 to 1 would progressively diminish the role of the features corresponding to . This can be prethe largest among the values vented by selecting a value of close to 0, which would equalize the effect of all features on the total residual error. If the values are similar, then all features have an approxof imately uniform effect on the total residual error. In such a case, the value of can be selected to reduce the value of the total residual error. According to the previous analysis, this can be accomplished if approaches 1. The previous analysis indicated that the selection of the value of for a given data set depends rather critically on the . Nevertheless, the values of relative sizes of are not readily available before the application in of the LVQ or clustering algorithm. The definition of and the (18) indicates that there is a relationship between the feature variance of the th feature. For example, if vectors are all represented by a single prototype , which is , obtained as their centroid (or mean). In such a case, , and , where denotes the variance of the th feature computer over the entire feature set. This indicates can be roughly estimated in practice as . Note that that such an estimation may not be reliable for certain data sets, of reduces as while the accuracy of the estimate increases above 1. The value of can be selected in practice by relying on of the features computed over the variances the entire feature set. The range of these variances can be , where quantified be computing the ratio and . A value of close to 1 indicates that the variances corresponding to different features are similar. This would allow the reduction of the total residual error by selecting a value of close to 1. The result of such a choice is that the features with the smallest variances would have a more significant impact on the LVQ or clustering process. This property of the proposed algorithms is highly desirable, given that the features with the smallest variances provide the most reliable basis for cluster separation. A value of considerably higher than 1 indicates that there are significant differences corresponding to difamong the variances ferent features. In such a case, the effect of different features on the total residual error can be equalized by selecting a value of close to 0. Such a choice would prevent the features with the largest variances from dominating all the rest. This is another desirable property of the proposed algorithms, given that the features with the largest variances provide the least reliable basis for cluster separation. VI. EXPERIMENTAL RESULTS This section presents an evaluation of the proposed NEFCM algorithm on four data sets, which differ in terms of the data

structure and the dimensionality of the feature vectors. The proposed NEFCM algorithm produced by the generator function was compared with the FCM, which is produced by the same generator function but employs the Euclidean norm as a distance measure. The FCM algorithm was chosen to represent fuzzy clustering algorithms relying on the Euclidean norm due to its well-known properties and performance. The proposed algorithm was also compared with the GK algorithm, which was developed by assigning different weighted norms to the prototypes. The GK algorithm was tested in all the experiments by requiring that the determinants of all weight matrices corresponding to the prototypes be equal. Finally, the proposed algorithm was compared with the GG algorithm, which was tested with a predetermined number of clusters. All four algorithms and were tested in the experiments with various values of their performance was evaluated based on the number of crisp clustering errors, i.e., the number of feature vectors that are assigned to a wrong physical cluster by terminal nearest-prototype partitions of the data. The distribution of the clustering errors among the feature vectors belonging to different physical classes is summarized by the confusion matrix. Each entry of the confusion matrix represents the number of feature vectors from the th physical class assigned to the cluster represented by the th prototype. In addition to the number and distribution of the clustering errors, the evaluation of clustering algorithms essentially involves their implementation simplicity, robustness, and computational complexity [2], [3]. A. ROCK Data The ROCK data set was produced by an experiment designed to simulate a martian subsurface ecosystem in an attempt to explore possible biogenic features of the martian meteorite ALH84001. Water from a deep basaltic aquifer was used to cultivate unweathered basaltic rock. The resulting mineralized features were measured with an electron microscope for the longest and shortest dimensions. The results can be divided into two classes, known bacteria and filaments, which may or may not be biogenic in origin. Fig. 1 shows the distribution of the feature vectors that represent the physical classes “filament” and “bacteria” in the two–dimensional (2-D) feature space defined by the features “length of crystal” (first feature corresponding to the horizontal axis) and “width of crystal” (second feature corresponding to the vertical axis). The variances of the two and . The features are: . ratio between the feature variances is The feature vectors from the ROCK data set were clustered clusters by the FCM algorithm, the GK algorithm, in the GG algorithm, and the proposed single-norm NEFCM algorithm tested with various values of . The algorithms were tested in 100 trials, each initialized by a different set of randomly generated prototypes. A trial was considered successful if the majority of the feature vectors belonging to a certain physical class were represented by the same prototype, called the majority prototype for this class, while each physical class had a different majority prototype. Table II shows the average number of clustering errors produced in the successful trials by the FCM, the GK, the GG, and the single-norm NEFCM algorithms, together with the percentage of failed trials. The FCM algorithm failed

KARAYIANNIS AND RANDOLPH-GIPS: SOFT LEARNING VECTOR QUANTIZATION AND CLUSTERING ALGORITHMS

429

Fig. 1. Boundary between the two clusters and feature vectors equidistant from the prototypes v and v produced for the ROCK data by: (a) the FCM algorithm, (b) the GK algorithm, and the single-norm NEFCM algorithm tested with (c) r = 0, (d) r = 0:01, (e) r = 0:05, (f) r = 0:1, (g) r = 0:5, and (h) r = 1.

in almost all trials, which is not surprising given the nature of the ROCK data set. In fact, the distribution of the feature vectors in the feature space indicates that this data set would be particularly challenging for most, if not all, clustering algorithms relying on the Euclidean norm. The GG algorithm also failed in all trials when tested on this data set. The GK algorithm exhibited the best performance among all algorithms tested on the ROCK data. According to Table II, the number of clustering errors produced by the proposed NEFCM algorithm was mainly affected by the value of but the number of successful trials increased consistently as the value of increased from 1.1 to 4. The per-

formance of the NEFCM algorithm improved consistently as the value of increased from 0 to 1. Thus, the results shown in Table II establish a relationship between the values of that corproduce the best partitions and the ratio responding to this data set. More specifically, the experiments is small enough to justify the indicated that the ratio selection of a value of close to 1. Fig. 1 illustrates the behavior of the algorithms tested in these experiments on the ROCK data. Fig. 1(a) and (b) shows the boundary between the two clusters and the feature vectors that are equidistant from the two prototypes and produced by

430

TABLE II PERFORMANCE OF THE FCM ALGORITHM, THE GK ALGORITHM, THE GG ALGORITHM, AND THE SINGLE-NORM NEFCM ALGORITHM TESTED WITH VARIOUS VALUES OF r ON THE ROCK DATA: NUMBER OF CLUSTERING ERRORS RECORDED ON AVERAGE IN 100 TRIALS AND PERCENTAGE OF FAILED TRIALS (SHOWN IN PARENTHESIS)

the FCM and the GK algorithms, respectively. The same information is shown in Fig. 1(c)–(h) for the proposed single-norm NEFCM algorithm tested with various values of . The features equidistant form each of the prototypes produced by the FCM , due to the use algorithm belong to circles of the Euclidean norm, while the boundary between the two clusters is a straight line. Fig. 1(a) reveals that the partition of the feature space produced by the FCM algorithm is not consistent with the distribution of the data, due to the fact that the algorithm attempts to create “circular” clusters in a data set that contains no such clusters. The feature vectors that are equidistant from each of the two prototypes produced by the GK algorithm belong to ellipses whose main directions are consistent with the distribution of the data in the feature space. In this case, the boundary between the clusters is a conic. This is consistent with the fact that the GK algorithm assigns a different weighted norm to each of the two prototypes. The feature vecand tors that are equidistant from each of the prototypes produced by the single-norm NEFCM algorithm belong to el, while the boundaries between the lipses clusters are straight lines. In this case, the ratio along the main directions of the ellipses corresponding to both prototypes is the same. These ratios were produced by the algorithm as the best compromise in its attempt to capture the structure of the data. The value of affected the slope of the boundary between the classes, as is evident from Fig. 1(c)–(h). As increased above 0 and approached 1, the cluster boundaries became almost parallel to the horizontal axis. As a result, increasing the value of from 0 to 1 reduced the number of clustering errors because the proposed algorithm was able to handle the feature vectors from the physical class “filament” with values of the first feature (“length of crystal”) above 2500 nms. B. IRIS Data The IRIS data set contains 150 feature vectors of dimension 4, which belong to three physical classes representing different IRIS subspecies [1]. Each class contains 50 feature vectors. One of the three classes is well separated from the other two, which are not easily separable due to the overlapping of their convex hulls. The reliability of the four features of the IRIS data set , is revealed by their variances, which are: , , and . The third feature has the largest variance while the ratio between the largest and smallest . variances is

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

TABLE III PERFORMANCE OF THE FCM ALGORITHM, THE GK ALGORITHM, THE GG ALGORITHM, AND THE SINGLE-NORM NEFCM ALGORITHM TESTED WITH VARIOUS VALUES OF r ON THE IRIS DATA: NUMBER OF CLUSTERING ERRORS RECORDED ON AVERAGE IN 100 TRIALS AND PERCENTAGE OF FAILED TRIALS (SHOWN IN PARENTHESIS)

The feature vectors from the IRIS data set were clustered in clusters by the FCM algorithm, the GK algorithm, the GG algorithm, and the proposed NEFCM algorithm tested with various values of . The algorithms were tested in 100 trials, each initialized by the same randomly generated set of prototypes. Table III shows the number of clustering errors produced on average in the successful among 100 trials and the percentage of failed trials. On average, the number of clustering errors produced by the FCM algorithm varied from 14 to 17. This range of clustering errors is typical for a broad variety of fuzzy and soft LVQ and clustering algorithms relying on the Euclidean norm [5], [20], [23], [27]. The number of clustering errors produced by the GK algorithm varied from 7 to 28, with the minimum . The GG number of clustering errors produced for algorithm produced approximately ten clustering errors regardless of the value of . The proposed NEFCM algorithm outperformed the FCM, GK, and GG algorithms on the IRIS data set. The performance of the proposed algorithm generally improved as the value of increased from 0 to 1; the NEFCM algorithm achieved its best performance for and . Table IV shows the prototypes, the confusion matrices, and the diagonal entries of the weight matrices produced by the NEFCM algoand various values rithm tested on the IRIS data with of . These results can be compared with those of the FCM, the GK, and the GG algorithms, which were tested with the same initialization. It is clear from Table IV that the FCM, the GK, the GG, and the proposed algorithms produced very similar prototypes, especially for values of lower than 1. However, despite the similarity of the prototypes, nearest-prototype partition of the feature vectors resulted in fewer clustering errors for the proposed algorithms. Thus, the differences in the performance of the FCM, the GK, the GG, and the proposed algorithm can only be attributed to the norm they employ to measure the distance between the feature vectors and the prototypes. The behavior of the proposed NEFCM algorithm can be analyzed even further by comparing the relative sizes of the weights assigned by the algorithm to different features. Regardless of the value of , the two largest weights were assigned by the algorithm to the second and fourth features, which provide the most reliable basis for cluster separation since they correspond to the smallest variances. However, the largest weight was assigned consistently to the fourth feature, despite the fact that the second feature has the smallest variance. This experimental outcome is consistent with the fact

KARAYIANNIS AND RANDOLPH-GIPS: SOFT LEARNING VECTOR QUANTIZATION AND CLUSTERING ALGORITHMS

TABLE IV FINAL PROTOTYPES v = [v v v v ] , 1 j 3, CONFUSION = [c ], AND DIAGONAL ENTRIES OF THE WEIGHT MATRIX MATRICES = diag w ; w ; w ; w PRODUCED FOR THE IRIS DATA BY THE FCM ALGORITHM, THE GK ALGORITHM, THE GG ALGORITHM, AND THE SINGLE-NORM NEFCM ALGORITHM TESTED WITH VARIOUS VALUES OF r

W

C f

 

g

431

TABLE V PERFORMANCE OF THE FCM ALGORITHM, THE GK ALGORITHM, THE GG ALGORITHM, AND THE SINGLE-NORM NEFCM ALGORITHM TESTED WITH VARIOUS VALUES OF r ON THE THYROID DATA: NUMBER OF CLUSTERING ERRORS RECORDED ON AVERAGE IN 100 TRIALS AND PERCENTAGE OF FAILED TRIALS (SHOWN IN PARENTHESIS)

TABLE VI = [v v v v v ] , 1 j 3, CONFUSION FINAL PROTOTYPES MATRICES = [c ], AND DIAGONAL ENTRIES OF THE WEIGHT MATRIX = diag w ; w ; w ; w ; w PRODUCED FOR THE THYROID DATA BY THE FCM ALGORITHM, THE GK ALGORITHM, THE GG ALGORITHM, AND THE SINGLE-NORM NEFCM ALGORITHM TESTED WITH VARIOUS VALUES OF r

W

C f

v

 

g

that the weights of the norm are the product of a minimization process and, as such, they do not depend exclusively on the statistics of the individual features. Finally, Tables III and IV indicate that the proposed NEFCM algorithm achieved its best performance for values of close to 1, which was also the case when the algorithm was tested on the ROCK data. Given that both the IRIS and the ROCK data sets correspond to comparable values of , this experimental outcome verifies the relationship and the value of that was between the ratio revealed by the experiments on the ROCK data. C. THYROID Data The THYROID data set was acquired from the University of California at Irvine web site. This data set consists of five continuous-valued features belonging to three physical classes. This data set resulted from five laboratory tests performed to determine if a patient has hypothyroidism, hyperthyroidism or euthyroidism (i.e., normal thyroid function). The five tests were: 1) T3-resin uptake test (percentage), 2) total serum thyroxine as measured by the isotopic displacement method, 3) total serum triiodothyronine as measured by radioimmuno assay, 4) basal thyroid-stimulating hormone (TSH) as measured by radioimmuno assay, and 5) maximal absolute difference of TSH value after injection of 200 micro grams of thyrotropin-releasing hormone as compared to the basal value. There were 150 patients with normal thyroid, 35 with hyperthyroidism, and 30 with hypothyroidism. The reliability of the five features of the THYROID data set can be rated based on their variances, which are:

, , , , . According to the variances of the features, and the third (first) feature provides the most (least) reliable basis for cluster separation. Moreover, the variances of the features span a wide range, which can be quantified by the ratio . The feature vectors of the THYROID data set were clustered by the FCM algorithm, the GK algorithm, the GG algorithm, and the proposed NEFCM algorithm tested with various values of . Table V shows the average number of clustering errors computed over the successful among 100 trials and the percentage of failed trials. The algorithms were initialized in each of the 100 trials by the same randomly generated set of prototypes. Table VI shows the prototypes, the confusion matrices, and the diagonal entries of the weight matrices produced for the THYROID data by the FCM, the GK, the GG, and the

432

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

proposed NEFCM algorithms. According to Table V, the prowhen posed algorithm was rather sensitive to the value of tested on the THYROID data set. In fact, the proposed algoand values of rithm achieved its best performance for close to 0 but its performance degraded considerably as the value of increased above 0.5 and approached 1. Compared with the proposed algorithm, the FCM algorithm resulted in lower failure rates for values of between 1.1 and 2. However, the proposed algorithm produced a considerably lower number of clustering errors in the successful trials. The GK algorithm produced fewer clustering errors than the FCM and the pro, posed NEFCM algorithms for values of below 2. For the GK algorithm produced a larger number of clustering errors than those produced by the proposed NEFCM algorithm; however, the clustering errors produced by the GK algorithm were averaged over a higher number of successful trials. The GG algorithm produced approximately 28 clustering errors regardless of the value of . This algorithm also led to the largest number of successful trials compared with all the other algorithms tested on this data set. On the other hand, the proposed algorithm produced approximately 23 clustering errors on avand values of close to 0. erage when tested with Table VI also reveals a clear relationship between the variances of individual features and the weights assigned to them by the proposed algorithm. Regardless of the value of , the weights produced by the proposed algorithm can be ordered according to the variances of the individual features, with the largest weight assigned to the feature with the smallest variance (third feature) and the smallest weight assigned to the feature with the largest variance (first feature). For this particular data set, increasing the value of reduced considerably the size of the weights assigned to the features corresponding to large variances, which provide the least reliable basis for cluster separation. As a result, increasing the value of almost diminished the effect of these features on the clustering process with a negative impact on the quality of the resulting partitions. The results produced by the proposed NEFCM algorithm on the THYROID data are consistent with the analysis, which indicated that values of close to 0 are more appropriate when there are significant differences among the variances of individual features. This is indeed the case for the THYROID data set, as indicated by the value of the . In fact, the selection of a value ratio of close to 0 is justified by the fact that the value of corresponding to the THYROID data is approximately six times and corresponding to higher than the values the ROCK and IRIS data, respectively.

TABLE VII PERFORMANCE OF THE FCM ALGORITHM, THE GK ALGORITHM, THE GG ALGORITHM, AND THE SINGLE-NORM NEFCM ALGORITHM TESTED WITH VARIOUS VALUES OF r ON THE WINE DATA: NUMBER OF CLUSTERING ERRORS RECORDED ON AVERAGE IN 100 TRIALS AND PERCENTAGE OF FAILED TRIALS (SHOWN IN PARENTHESIS)

D. WINE Data The WINE data set was acquired from the University of California at Irvine web site. This data set consists of 13 continuous-valued features belonging to three physical classes. This data set was obtained by chemical analysis of wine produced by three different cultivators from the same region of Italy. The features are: alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline. This data set contains 178 feature vectors, with 59 in class 1, 71 in class 2, and 48 in class 3.

For this particular data set, the variances of the features span a wide range of values between and and correspond to a ratio . With the exception of the feature variances , , , , the variances of the rest of the features take and values below 1. The distribution of the feature variances indicates that the variance corresponding to the thirteenth feature is clearly an outlier, which implies that the corresponding feature is not suited for cluster separation. Nevertheless, this feature was kept in the data in order to evaluate the ability of the proposed NEFCM algorithm to handle such extreme situations. The feature vectors from the WINE data set were clustered by the FCM algorithm, the GK algorithm, the GG algorithm, and the proposed single-norm NEFCM algorithm tested with various values of . All algorithms tested in the experiments were initialized using the same set of randomly generated prototypes. Table VII shows the number of clustering errors produced on average by the algorithms in the successful out of 100 trials and the percentage of failed trials. On average, the FCM algorithm clustered incorrectly about 55 of the 178 feature vectors included in the WINE data set when tested with values of between 1.5 and 4. The GK algorithm produced a smaller number of clustering errors than the FCM algorithm but it failed in more trials. The failure of the GK algorithm on this data set can be attributed to the requirement that the determinants of all weight matrices be equal. This requirement makes the algorithm less effective when the data set contains unequal hyperellipsoidal clusters. Once again, the GK algorithm achieved its best performance for values of below 2. The performance of the GG algorithm was close to that of the FCM algorithm in terms of the number of clustering errors. Compared with the GK algorithm, both FCM and GG algorithms led to a larger number of successful trials. According to Table VII, the proposed NEFCM algorithm performed considerably better than the FCM, the GK, , , and values of in and the GG algorithms for a certain range. For values of between 1.5 and 2, the number of clustering errors produced by the proposed NEFCM algorithm increased from 9 to 13 as the value of increased from 0 to 0.1. The performance of the NEFCM algorithm degraded as the value of increased above 0.1, while the algorithm failed to cluster the WINE data for values of above 0.5. This experimental outcome is consistent with the analysis, which indicated that the value of must be as close to 0 as possible since the

KARAYIANNIS AND RANDOLPH-GIPS: SOFT LEARNING VECTOR QUANTIZATION AND CLUSTERING ALGORITHMS

value of the ratio is much higher than 1.

corresponding to this data set

VII. CONCLUSION This paper proposed a new approach to the development of LVQ and clustering algorithms based on a weighted norm; this approach relied on recent advances in LVQ and clustering made possible by reformulation. The inherent advantages of reformulation were exploited in this approach by formulating LVQ and clustering as the minimization of a reformulation function involving an adjustable weighted norm under an equality constraint imposed on the norm weights. This approach imposed a constraint on the generalized mean of the norm weights, which led to novel clustering and LVQ algorithms that can be implemented as the NEFCM and NEFLVQ algorithms. The implementation complexity and computational requirements of the proposed algorithms are comparable with those of the FCM and FLVQ algorithms relying on the Euclidean norm. The extra effort associated with the implementation of the proposed algorithms is the computation of the diagonal entries of the weight matrix, which are obtained from closed-form expressions that involve no matrix inversion. The error analysis outlined in this paper established a relationship between the constraint imposed on the generalized mean of the norm weights and the sample variances of the individual features. The experiments on all four data sets indicated that the proposed NEFCM algorithm outperformed the FCM algorithm, which is produced by the same generator function but relies on the Euclidean norm to measure the distance between the feature vectors and the prototypes. Thus, this experimental outcome can only be attributed to the use of an adjustable weighted norm, which allows the proposed algorithms to better capture the structure of the data by forming hyperellipsoidal clusters. The GK algorithm was a stronger competitor to the proposed NEFCM algorithm than the FCM, especially when tested with below 2. Nevertheless, the performance of the values of GK algorithm on various data sets lacked consistency. The significant fluctuations observed in the performance of the GK algorithm reveal its strong dependence on the free parameters that must be selected by the user to impose constraints on the determinants of the weight matrices. The performance of the GG algorithm was not significantly affected by the value of . However, its performance on various data sets was not consistent. For example, the GG algorithm performed poorly on the ROCK and WINE data sets despite its very satisfactory performance on the IRIS and THYROID data. It must also be noted here that both the GK and GG algorithms are computationally more demanding than the proposed NEFCM algorithm. This is due to the definition of the weighted norms employed by these algorithms in terms of nondiagonal weight matrices. As a result of this formulation, the computation of the distinct weight matrices assigned to the prototypes requires the inversion of matrices during each iteration of the GK and GG algorithms. In contrast, the diagonal entries of the single weight matrix involved in the implementation of the NEFCM algorithm can be computed through closed-form formulas that require no matrix inversion. The NEFCM algorithm took only

433

slightly longer than the FCM to complete cluster formation. The GK algorithm took approximately five times as long as the FCM, while the GG took at least one order of magnitude longer than the FCM. Note that the computational requirements of the GK and GG algorithms can be reduced by employing diagonal weight matrices, according to the approach proposed in [16]. The constraint imposed by this approach on the norm weights relied on the generalized mean, which is perhaps the most popular aggregation operator. An apparent extension of the formulation proposed in this paper is to replace the generalized mean by alternative aggregation operators. Another possible extension of the proposed formulation is the development of hard clustering algorithms relying on a non-Euclidean weighted norm. The formulation proposed in this paper can be also extended to produce LVQ and clustering algorithms that employ a different weighted norm to measure the distance between each prototype and the feature vectors. The potential of multinorm algorithms was revealed by some experimental results obtained by testing single-norm and multinorm clustering algorithms on the THYROID data set [30]. The proposed formulation may also be beneficial to the construction of supervised classifiers implemented as trainable neural networks models. In fact, non-Euclidean norms may be used as distance measures for radial basis function (RBF) neural networks [29]. The potential of such an evolution in the structure of RBF models is also supported by recent theoretical developments [20], which indicated that RBF neural networks and LVQ models are essentially the product of the same nonlinear mapping. APPENDIX A UPDATE EQUATIONS FOR THE PROTOTYPES The update equations for the prototypes can be derived by using gradient descent to minimize the reformulation function (5), which can be rewritten as (A1) where

. The gradient of with respect to the can be obtained as

prototype

Since and (A2) gives

,

(A2) ,

(A3) where

are the competition functions, defined as (A4)

of each prototype The update equation for the entries can be obtained according to the

434

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 2, MARCH 2005

gradient descent method as yields

, which

corresponding to the constraint The weight matrix can be obtained by using the method of Lagrange multipliers, which leads to the minimization of

(A5) (B8) is the learning rate for the th entry where , the of the prototype . Since update equations in (A5) can also be written as

The weights corresponding to this constraint can also be ob. According to (B7) tained from (B7) in the limit

(A6) .

where

(B9) Since

APPENDIX B DETERMINING THE WEIGHT MATRIX , the weight matrix can be determined by For are fixed and minimizing assuming that the prototypes for . Using the method of Lagrange multipliers, this constrained minimization problem can be converted into the unconstrained minimization of

as

(B10) Combining (B9) and (B10) gives (B11)

(B1) is the reformulation function defined in (5) and where is the Lagrange multiplier. This problem can be solved by eliminating the Lagrange multiplier based on the conditions (B2) and

ACKNOWLEDGMENT The authors would like to thank S. Wentworth of Lockheed Space Operations for providing the ROCK data set. The THYROID and WINE data sets were acquired from the University of California at Irvine web site at http://www.ics.uci.edu/~mlearn/MLRepository.html.

(B3) REFERENCES The partial derivative as

can be obtained using (5)

(B4) where tions. Since can be written as

are the competition func, (B4) (B5)

where (B6) For , the diagonal entries of mined by eliminating the Lagrange multiplier conditions (B2) and (B3) as

can be deterbetween the

(B7)

[1] E. Anderson, “The IRISes of the Gaspe peninsula,” Bull. Amer. IRIS Soc., vol. 59, pp. 2–5, 1939. [2] A. Baraldi and E. Alpaydm, “Constructive feedforward ART clustering networks–Part I,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 645–661, May 2002. [3] , “Constructive feedforward ART clustering networks–Part II,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 662–677, May 2002. [4] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum, 1981. [5] J. C. Bezdek and N. R. Pal, “Two soft relatives of learning vector quantization,” Neural Netw., vol. 8, no. 5, pp. 729–743, 1995. [6] L. Bobrowski and J. C. Bezdek, “c-means clustering with the ` and ` norms,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, pp. 545–554, May–Jun. 1991. [7] V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory, and Methods. New York: Wiley, 1998. [8] D. Dubois and H. Prade, “A review of fuzzy set aggregation connectives,” Inform. Sci., vol. 36, no. 1–2, pp. 85–121, 1985. [9] H. Dyckhoff and W. Pedrycz, “Generalized means as a model of compensative connectives,” Fuzzy Sets Syst., vol. 14, no. 2, pp. 143–154, 1984. [10] G. Frosini, B. Lazzerini, and F. Marcelloni, “A modified fuzzy c-means algorithm for feature selection,” in Proc. 19th Int. Conf. North American Fuzzy Information Processing Soc., Atlanta, GA, Jul. 2000, pp. 148–152. [11] I. Gath and A. B. Geva, “Unsupervised optimal fuzzy clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 773–781, Jul. 1989. [12] D. E. Gustafson and W. Kessel, “Fuzzy clustering with a fuzzy covariance matrix,” in Proc. IEEE Conf. Decision Control, San Diego, CA, 1979, pp. 761–766.

KARAYIANNIS AND RANDOLPH-GIPS: SOFT LEARNING VECTOR QUANTIZATION AND CLUSTERING ALGORITHMS

[13] R. J. Hathaway and J. C. Bezdek, “NERF c-means: Non-Euclidean relational fuzzy clustering,” Pattern Recognit., vol. 27, no. 3, pp. 429–437, 1994. [14] , “Optimization of clustering criteria by reformulation,” IEEE Trans. Fuzzy Syst., vol. 3, no. 2, pp. 241–246, May 1995. [15] R. J. Hathaway, J. C. Bezdek, and Y. Hu, “Generalized fuzzy c-means clustering strategies using L norm distances,” IEEE Trans. Fuzzy Syst., vol. 8, no. 5, pp. 576–582, Oct. 2000. [16] F. Höppner, F. Klawonn, R. Kruse, and T. Runkler, Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. New York: Wiley, 1999. [17] K. Jajuga, “L -norm based fuzzy clustering,” Fuzzy Sets Syst., vol. 39, pp. 43–50, 1991. [18] N. B. Karayiannis, “Fuzzy partition entropies and entropy constrained fuzzy clustering algorithms,” J. Intell. Fuzzy Syst., vol. 5, no. 2, pp. 103–111, 1997. [19] , “A methodology for constructing fuzzy algorithms for learning vector quantization,” IEEE Trans. Neural Netw., vol. 8, no. 3, pp. 505–518, May 1997. , “Reformulating learning vector quantization and radial basis [20] neural networks,” Fundamenta Informaticae, vol. 37, pp. 137–175, 1999. , “An axiomatic approach to soft learning vector quantization [21] based on reformulation,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1153–1165, Sep. 1999. , “From aggregation operators to soft learning vector quantization [22] and clustering algorithms,” in Kohonen Maps, E. Oja and S. Kaski, Eds. Amsterdam, The Netherlands: Elsevier, 1999, pp. 47–56. [23] , “Generalized fuzzy c-means algorithms,” J. Intell. Fuzzy Syst., vol. 8, no. 1, pp. 63–81, 2000. [24] , “Soft learning vector quantization and clustering algorithms based on ordered weighted aggregation operators,” IEEE Trans. Neural Netw., vol. 11, no. 5, pp. 1093–1105, Sep. 2000. [25] , “Soft learning vector quantization and clustering algorithms based on mean-type aggregation operators,” Int. J. Fuzzy Syst., vol. 4, no. 3, pp. 739–751, 2002. [26] N. B. Karayiannis and J. C. Bezdek, “An integrated approach to fuzzy learning vector quantization and fuzzy c-means clustering,” IEEE Trans. Fuzzy Syst., vol. 5, no. 4, pp. 622–628, Nov. 1997. [27] N. B. Karayiannis, J. C. Bezdek, N. R. Pal, R. J. Hathaway, and P.-I. Pai, “Repairs to GLVQ: A new family of competitive learning schemes,” IEEE Trans. Neural Netw., vol. 7, no. 5, pp. 1062–1071, Sep. 1996. [28] N. B. Karayiannis and P.-I. Pai, “Fuzzy algorithms for learning vector quantization,” IEEE Trans. Neural Netw., vol. 7, no. 5, pp. 1196–1211, Sep. 1996. [29] N. B. Karayiannis and M. M. Randolph-Gips, “Reformulated radial basis neural networks with adjustable weighted norms,” in Proc. Int. Joint Conf. Neural Networks, vol. 3, Como, Italy, Jul. 24–27, 2000, pp. 608–613. , “Soft learning vector quantization and clustering algorithms based [30] on non-Euclidean norms: Multi-norm algorithms,” IEEE Trans. Neural Netw., vol. 14, no. 1, pp. 89–102, Jan. 2003. [31] N. B. Karayiannis and N. Zervos, “Entropy-constrained learning vector quantization algorithms and their application in image compression,” J. Electron. Imag., vol. 9, no. 4, pp. 495–508, 2000. [32] A. Keller and F. Klawonn, “Fuzzy clustering with weighting of data variables,” Int. J. Uncertainty, Fuzziness and Knowledge-Based Syst., vol. 8, no. 6, pp. 735–746, 2000. [33] F. Klawonn and R. Kruse, “Derivation of fuzzy classification rules from multidimensional data,” in Advances in Intelligent Data Analysis, G. E. Lasker and X. Liu, Eds. Windsor, ON, Canada: IIAS, 1995, pp. 90–94.

435

[34] G. J. Klir and B. Yuan, Fuzzy Sets and Fuzzy Logic: Theory and Applications. Englewood Cliffs, NJ: Prentice-Hall, 1995. [35] R. Krishnapuram and J. Kim, “A note on the Gustafson-Kessel and adaptive fuzzy clustering algorithms,” IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 453–461, Aug. 1999. [36] , “Clustering algorithms based on volume criteria,” IEEE Trans. Fuzzy Syst., vol. 8, no. 2, pp. 228–236, Apr. 2000. [37] J. T. Tou and R. C. Gonzalez, Pattern Recognition Principles. Reading, MA: Addison-Wesley, 1974. [38] E. C.-K. Tsao, J. C. Bezdek, and N. R. Pal, “Fuzzy Kohonen clustering networks,” Pattern Recognit., vol. 27, no. 5, pp. 757–764, 1994. [39] R. R. Yager, “On mean type aggregation,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 26, no. 2, pp. 209–221, Apr. 1996.

Nicolaos B. Karayiannis (S’86–M’91–SM’01) was born in Greece, on January 1, 1960. He received the Diploma degree in electrical engineering from the National Technical University of Athens, Greece, in 1983 and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Toronto, ON, Canada, in 1987 and 1991, respectively. He is currently a Professor with the Department of Electrical and Computer Engineering, University of Houston, TX. From 1984 to 1991, he was a Research and Teaching Assistant at the University of Toronto. From 1983 to 1984, he was a Research Assistant at the Nuclear Research Center Democritos, Athens, Greece, where he was engaged in research on multidimensional signal processing. He has published more than 130 papers, including 60 in technical journals, and is the coauthor of the book Artificial Neural Networks: Learning Algorithms, Performance Evaluation, and Applications (Boston, MA: Kluwer, 1993). His current research interests include wireless communications and networking, computer vision, image and video coding, neural networks, intelligent and neuro-fuzzy systems, and pattern recognition. Dr. Karayiannis received the W. T. Kittinger Outstanding Teacher Award in 1994 and the University of Houston El Paso Energy Foundation Faculty Achievement Award in 2000. He is also a co-recipient of a Theoretical Development Award for a paper presented at the Artificial Neural Networks in Engineering Conference in 1994. He is an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS and the IEEE TRANSACTIONS ON FUZZY SYSTEMS. He also served as the General Chair of the 1997 International Conference on Neural Networks, Houston, in 1997. He is a member of the International Neural Network Society (INNS) and the Technical Chamber of Greece.

Mary M. Randolph-Gips received the B.S. degrees in electrical engineering and engineering physics from the University of Kansas, Lawrence, in 1990, the M.S. degree in electro-optics from the University of Houston–Clear Lake, TX, in 1995, and the Ph.D. degree in electrical and computer engineering from the University of Houston, TX, in 2002. She is currently engaged in research on intelligent control systems for ultrasonic motors at the University of Houston-Clear Lake. From 1990 to 1997, she worked as a Space Shuttle Flight Controller in Payload Operations and Space Station Command and Data Handling for the United Space Alliance Corporation. Her research interests include neural networks, data mining, and fuzzy control.