Cluster-based adaptive metric classification - Semantic Scholar

6 downloads 14092 Views 367KB Size Report
Dec 24, 2011 - we call the Cluster-Based Adaptive Metric (CLAM) classification, ...... munications of the National Center for Scientific. Research, Athens ...
Neurocomputing 81 (2012) 33–40

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Cluster-based adaptive metric classification Ioannis Giotis n, Nicolai Petkov Johann Bernoulli Institute for Mathematics and Computer Science, University of Groningen, PO Box 407, 9700 AK, Groningen, The Netherlands

a r t i c l e i n f o

abstract

Article history: Received 21 February 2011 Received in revised form 11 October 2011 Accepted 16 October 2011 Communicated by Xiaofei He Available online 24 December 2011

Introducing adaptive metric has been shown to improve the results of distance-based classification algorithms. Existing methods are often computationally intensive, either in the training or in the classification phase. We present a novel algorithm that we call Cluster-Based Adaptive Metric (CLAM) classification. It first determines the number of clusters in each class of a training set and then computes the parameters of a Mahalanobis distance for each cluster. The derived Mahalanobis distances are then used to estimate the probability of cluster- and, subsequently, class-membership. We compare the proposed algorithm with other classification algorithms using 10 different data sets. The proposed CLAM algorithm is as effective as other adaptive metric classification algorithms yet it is simpler to use and in many cases computationally more efficient. & 2011 Elsevier B.V. All rights reserved.

Keywords: Adaptive metric Cluster estimation Gap statistic Prototype-based classification Principal component analysis Bayes’ rule

1. Introduction Pattern classification is an important scientific problem with numerous applications in various areas [1]. In the following, we are concerned with statistical pattern classification of vector data based on supervised learning. More specifically, we consider a finite training data set D which consists of r disjoint subsets Di, i¼1yr, of observations that belong to r different classes. Each observation is an n-dimensional vector that has a class label. The goal is to ‘‘learn’’ a model using the training set D in such a way that when a query vector q is presented the classifier will be able to correctly predict the class label of q. The k-nearest neighbors (k NN) algorithm [2] is one of the oldest and simplest methods proposed for classification. For each unlabeled query vector presented, the k nearest neighbor vectors from the training data set are found and the query vector is assigned to the most common class among these k neighbors. In most cases, the Euclidean distance is used in order to find the nearest neighbor vectors. As to the choice of the parameter k, popular choices ffi are k¼1 (nearest neighbor classification) and pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k ¼ ðminðni ÞÞ as mentioned in [1], where minðni Þ is the number of training vectors in the smallest class. Another possibility is to estimate the classification error for every value of k from 1 up to some kmax and take the value of k for which this error has a minimum for a test set.

n

Corresponding author. Tel.: þ31 50 363 7126; fax: þ31 50 363 3800. E-mail addresses: [email protected] (I. Giotis), [email protected] (N. Petkov).

0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.10.018

The performance of k NN classifiers can be improved by adapting the distance metric using a training data set [3–9]. The components analysis family of algorithms learns transformations from constraints. Relevant Components Analysis (RCA) [10] learns a global linear transformation from a set of equivalence constraints. The learned transformation can then be used to compute distances between any two examples. Discriminative Component Analysis (DCA) and Kernel DCA [11] improve RCA by exploring negative constraints and non-linear relationships using contextual information. Essentially, RCA and DCA can be viewed as extensions of Linear Discriminant Analysis (LDA) [12]. Neighborhood Component Analysis (NCA) [4] learns a distance metric by extending the nearest neighbor classifier. The Large Margin Nearest Neighbors (LMNN) algorithm [6] is considered to further extend and improve NCA. LMNN ‘‘learns’’ a local or global Mahalanobis distance based on the training data and then uses it for k NN classification instead of the Euclidean distance. The appropriate transformations are derived by minimizing a cost function that penalizes large distances between data points with similar class labels and small distances between data points with different class labels. The support vector machines (SVMs) algorithm [13] is another way to solve mainly two-class problems by constructing a (n 1)dimensional hyperplane that separates the feature space into two regions. Theoretically, there can be an infinite number of such hyperplanes. SVMs consider as optimal the hyperplane that achieves the largest separation (margin), between the two classes. The algorithm chooses the decision boundary (hyperplane) so that the distance from it to the nearest data vector of each class

34

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

is maximized. If the two classes are separated by a non-linear boundary, SVMs use a kernel function to map the data into a different space, where a linear boundary exists. One popular approach to solving multiclass problems with SVMs is reducing a single multiclass problem to multiple two-class problems and another is the Multiclass-SVM [14] that solves one, but larger optimization problem. SVMs have also been used in combination with nearest neighbor and component analysis approaches [15–17]. Learning Vector Quantization (LVQ) [18] and its numerous variants are a family of prototype-based classification algorithms. Given a set of data vectors of known classes, one or more prototype vectors are computed for each class. A new vector which needs to be classified is assigned to the class of the nearest prototype. In an extension of the basic LVQ algorithm, known as Generalized Relevance LVQ [19], a relevance value is assigned to each feature, indicating the weight with which that feature is taken into account in the distance computations in the classification process. The feature relevances are computed during the training process, simultaneously with the prototypes. In different variants of LVQ, different numbers of relevance vectors are used: one for the whole feature space, one for all prototypes of each class or one separate relevance vector for each prototype. A further extension is that of Generalized Matrix LVQ [20] where a relevance matrix is used to compute distances between vectors, thus resulting in a transformation of the feature space, similar to the LMNN algorithm and the kernel-based SVMs. An open issue in LVQ classifiers is the number of prototypes to be used for each class. One possible approach is to use cross-validation to determine the test error and study it as a function of the number of class prototypes. For a large number of prototypes per class, this method becomes infeasible due to the large number of possible combinations of different numbers of class prototypes. Recently, information-theoretic approaches towards distance learning have been proposed as well. In contrast to traditional techniques that assume a quadratic form for the distance between any two vectors, Information-Theoretic Metric Learning [21] considers the learning of the parameters of a Mahalanobis distance as a Bregman optimization problem, by minimizing the differential relative entropy (LogDet divergence) between two multivariate Gaussians that are subject to linear constraints on the distance function. Although successfully and widely used in the literature, all the aforementioned methods present a disadvantage of relying on different input parameters such as the number of neighbors used for nearest neighbor methods, the amount and type of constraints for component analysis, the kernel type and the kernel and slack parameters for SVMs and the number of prototypes per class for LVQ variants. In practice the issue of tuning these parameters is tackled by extensive cross-validation. This process can be computationally very expensive mainly with regard to the range of values for each parameter that needs to be optimized. Furthermore, some of the methods that use adaptive metrics already require a quite intensive training phase in order to compute the transformations of the feature space. In this paper we present a novel classification algorithm that we call the Cluster-Based Adaptive Metric (CLAM) classification, which addresses such issues. It comprises the following steps. First, we estimate the number of clusters within each class of the training set using the gap statistic method [22]. Second, we estimate a mean and a covariance matrix for each such cluster and use it to compute the parameters of a Mahalanobis distance to that cluster. Finally, we estimate the likelihoods of clustermembership for a query point, taking into account its distances to the clusters and the number of data points in the clusters. The main aspects of the CLAM algorithm are thus the use of (cluster

means as) prototypes and the computation of prototype-specific adaptive metrics. Modeling data as normal distributions is a principle mainly known from the Gaussian Mixture Models (GMMs). Although initially designed to be used for unsupervised learning, GMMs can also be employed to solve classification tasks. GMMs are based on an iterative process, called Expectation-Maximization (EM) [23], in order to estimate the parameters of a weighted mixture of Gaussian components. The CLAM algorithm, unlike EM, does not suffer from the drawback of having the number of Gaussian components set by the user and remaining fixed during the parameter estimation process. Conversely, we assume that we need to estimate a previously unknown amount of Gaussian components, since each class might comprise more than one of them. The Figueiredo–Jain (FJ) algorithm [24] also tackles this problem but in an often computationally expensive manner. It initially assumes an arbitrarily large number of components and subsequently adjusts their number by annihilating those components that are not supported by the data. It is possible though that it cannot yield any valid results due to annihilation of all components within a class or because it cannot find a positive definite symmetric covariance matrix. Finally, it is noteworthy that in CLAM not every component is modeled as a normal distribution in the same feature space. First, a suitable transformation for each component is derived and then the parameters of a normal distribution are estimated. This paper is organized as follows. In Section 2 we describe the proposed method in more detail. In Section 3 we apply this and other classification algorithms on various data sets and compare their performances. In Section 4 we discuss the proposed method and its results and finally in Section 5 we summarize the results and draw conclusions.

2. Method 2.1. Estimation of the number of clusters Fig. 1 illustrates a situation in which a single prototype per class is not sufficient. Therefore, as a first step we analyze each subset Si of data points that belong to one class to check if it consists of more than one clusters. For this purpose we use the gap statistic test as defined in [22]. The test compares the withincluster dispersion of a data set divided in k clusters with the dispersion of a reference set Ri that comprises one cluster. The dispersion index W Ski of a data set Si divided in k clusters is defined as W Ski ¼

k X 1 Dj 2n j j¼1

ð1Þ

where Dj is the sum of pair-wise distances between the nj points of the j-th cluster. The dispersion index is actually a pooled within-cluster sum of squares around the cluster means if the distance metric used in order to compute Dj is the squared Euclidean distance. The dispersion indices for the actual and the reference data sets are computed for all k ¼ 1; 2, . . . ,kmax . The parameter kmax can be user-defined but a reasonable choice is the number of data points in Si. The optimal number of clusters k^ is the value of k for which the logarithm of the dispersion index of the real data set falls the furthest below the respective one of the reference set. In a formal notation k^ ¼ argmaxðlogðW Rk i ÞlogðW Ski ÞÞ k

ð2Þ

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

35

Fig. 1. Artificial data set: 2-dimensional data from two classes (circles and crosses). Each class contains a few clusters originating from different Gaussian distributions. The bold circle and cross denote the centroids of the two classes. One prototype per class is insufficient to model such data.

where Ri is a set of uniformly distributed data points. For this case Tibshirani et al. [22] have shown that logðW Rk i Þ  ð2=pÞlogðkÞ, ^ where p is the dimensionality of the data. For any 1 o kr k, Si Ri logðW k Þ undergoes a steeper descent than logðW k Þ. Conversely, for k^ ok rkmax , we add unnecessary clusters to Si and therefore logðW Rk i Þ is now expected to have a larger descent step. Among all unimodal distributions, the uniform distribution is the one more likely to produce the smallest dispersion indices for any k 41. Consequently, choosing the reference set in that fashion makes the rejection of the hypothesis of a single component stricter. In CLAM the number of samples drawn for each reference set Ri is equal to the number of observations in the actual data set Si. For a reliable estimation of the reference set indices a number B 4 1 of reference data sets is created and the results for each k are averaged over these sets. The exact value of B depends on the objective one needs to meet. We use B ¼10 as suggested in [25]. The reference sets can be created in two ways: (a) by drawing samples from uniform distributions in the range defined by the minimum and maximum values of the features in the original data and (b) by drawing samples from uniform distributions aligned with the principal components of the data. In the latter case the samples drawn are subsequently transformed back to the original feature space in order to obtain a reference set. We use hierarchical clustering with single linkage. In this way, we can compute all dispersion indices without the need to perform the clustering procedure all over again for every different value of k. Initially Si and Ri consist of kmax different clusters. An iterative procedure is then applied until all data points are merged in one cluster (k¼1). In every such step we compute the distances between different clusters, the clusters with the minimum distance are merged in one and the respective dispersion indices W Ski and W Rk i for the data and the reference set are estimated (Eq. (1)). Single linkage defines the distance between two clusters A and B as the distance between the two closest elements in A and B dðA,BÞ ¼ min dða,bÞ a A A,b A B

ð3Þ

When all dispersion indices are available the value of k^ is computed using Eq. (2). In case the clustering result for the optimal k^ yields clusters consisting of less elements than a certain number nmin,

these clusters are eliminated and the respective data points are merged with the closest viable cluster. The default value for nmin is 1. In Fig. 2 the previous example of artificial data is presented again, only this time each class has been divided in clusters according to the gap statistic test using hierarchical clustering with single linkage and the Euclidean distance. 2.2. Cluster-based adaptation of the distance metric Next, we proceed to compute the mean m and the covariance matrix S of each cluster separately. The idea is to consider each cluster as a separate class and to model it by a normal distribution in order to estimate the class-conditional probability for any new point. The latter requires the computation of the Mahalanobis distance using the mean and the inverse of the covariance matrix. We first apply principal components analysis (PCA) [26] to the vectors of each cluster separately, thus obtaining data in a transformed space of uncorrelated features. Let Sij be the j-th cluster of the i-th class and its empirical mean be equal to zero. We first compute the singular value decomposition Sij ¼ W SV T and then obtain the set of principal components S0ij ¼ Sij V. The cluster mean mij and the covariance matrix Sij of S0ij are then computed using Maximum Likelihood Estimation (MLE) [27] as shown in Eqs. (4) and (5):

mij ¼

N 1X x Nl¼1 l

ð4Þ

Sij ¼

N 1 X ðx mij Þðxl mij ÞT N1 l ¼ 1 l

ð5Þ

The resulting covariance matrix Sij will be diagonal with distinct positions of the primary diagonal indicating the variance of different principal components. It is possible that some matrices are singular and cannot be inverted due to some principal components having zero variance within a cluster of data. This will yield an infinite Mahalanobis distance to this cluster, so such matrices need a correction. The PCA transform is applied within each cluster and therefore the variance of a principal component is inversely proportional to its discriminative power. We want to impose the principle that features with small variances should be heavily weighted when computing the distance metric, but those weights cannot be

36

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

Fig. 2. Artificial data set: two-dimensional data from two classes (circles and crosses). Using the gap statistic the circle class is divided in three clusters and the cross class is divided in two clusters. The cluster centroids are rendered bold. In this case the two classes are perfectly separated from each other.

Table 1 Data sets used for experiments. Set name

Instances

Iris Balance scale Glass identification Wine Image segmentation Optical recognition of handwritten digits Parkinsons [29] Wisconsin breast cancer Ionosphere Haberman

Features

Classes

150 625 214

4 4 10

3 3 2

178 2310 5620 195 569 351 306

13 19 64 22 30 34 3

3 7 10 2 2 2 2

Notes

465 instances comprise the training set and 160 the test set The original data is divided in seven classes that can be aggregated in two larger ones: ‘‘Window’’ and ‘‘Non-window’’ glass. We use the latter separation 210 instances comprise the training set, and the rest 2100 the test set 3823 instances comprise the training set and the rest 1797 the test set

infinite. Therefore, the zero values on the primary diagonal of the covariance matrices are substituted with the smallest non-zero variance found across all the clusters of a data set. This means that the most discriminative features are now assigned the highest finite weight possible within a data set. In order to be compliant with numerical machine precision we consider equal to zero the variances of those principal components that account for less than E ¼ 105 of the total variance in a cluster. ^ ij are obtained, the After the corrected covariance matrices S distance of a data vector x to the j-th cluster of the i-th class can be computed with the adapted metric: dij ðxÞ ¼ ðxm

T ^ 1 ij ÞS ij ðx ij Þ

m

1 PðC ij 9xÞpnij qffiffiffiffiffiffiffiffiffiffi e1=2dij ðxÞ ^ ij 9 9S

ð8Þ

Finally, the class-membership probability of a data vector x for the i-th class of a given problem is computed according to Eq. (9) as the sum of the relevant cluster-membership probabilities X Pðoi 9xÞp PðC ij 9xÞ ð9Þ C ij A oi

ð6Þ

In Eq. (6) the coordinates x of the data point are in the principal component space of the concerned cluster. 2.3. Cluster- and class-memberships probability estimation We define the cluster-membership probability for a data point x as a Bayesian posterior probability: PðC ij 9xÞpPðC ij ÞPðx9C ij Þ

point. The posterior probability that a data vector x belongs to a cluster C ij can now be formally defined using Eq. (8):

ð7Þ

The prior probability, PðC ij Þ, is related to the number nij of data vectors that belong to the cluster Cij. The likelihood of this cluster at a certain point x of the feature space, Pðx9C ij Þ, relates to the ^ ij Þ at the same probability density of a normal distribution Nr ðmij , S

3. Experimental results We evaluated the proposed classification technique on 10 data sets that vary in size and dimensionality. All data sets used are available in the UCI Machine Learning Repository [28]. The characteristics of the data sets are summarized in Table 1. We compare the CLAM classification with the k NN, LMNN, GMLVQ, SVM, GMM, ITML and DCA algorithms. In case a test set was provided the results presented in this section reflect the error rates obtained on it. For the remaining data sets results are obtained using Leave-One-Out cross-validation. The results of the gap statistic within the CLAM algorithm (sub-clusters per class) are also used as the number of prototypes per class for GMLVQ. With regard to GMM we use a Bayesian

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

classification scheme as suggested in [30] and the relevant implementation. Mixture models are trained with the Expectation Maximization algorithm. The number of Gaussian components varies from 1 to the maximum number of clusters found in one class by the gap statistic. With regard to the nearest neighbor based methods (k ffiNN, LMNN, ITML [31],DCA) we use pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k ¼ 1; 2, . . . ,b ðminðni ÞÞc and the results for those k’s that minimize the error rate are presented. Especially for DCA we use 1% positive and 1% negative constraints to learn a global transformation of the space, as suggested in [11] and for ITML we crossvalidate values for the slack variable g ¼ ½104 , . . . ,104 ]. In CLAM we set kmax ¼ minðni Þ=3, where minðni Þ is the number of training vectors in the smallest class. Finally, for the SVMs we use Gaussian and polynomial kernel machines from the SVM-KM toolbox implementation of Canu et al. [32]. For mutli-class problems we follow the Multiclass-SVM and the One-Against-All approach. The slack constant C, the degree of the polynomial d and the width of the Gaussian kernel g are set similarly to the work of Hsu and Lin [33]. Hence, we set C ¼ ½212 ,211 , . . . ,22  , g ¼ ½24 ,23 , . . . ,210  and d ¼ ½3; 2, . . . ,3. We conduct experiments with all matching combinations of the free parameters and finally present the result of the combination that minimizes the error rate. Experiments regarding CLAM were conducted using both ways of creating reference sets. Results indicate that taking into account the shape of the data distribution proves beneficial for all datasets apart from the Haberman and the Glass Identification. However, the differences of the error rates are rather small ( r 2%). As expected creating reference sets aligned with the principal components of the data requires some extra computational time, yet of negligible magnitude.

3.1. Classification accuracy results We present the estimated error rates in Table 2. In Fig. 3 a graphical representation of the individual results is shown together with the mean error rate for each method over all 10 evaluations. This figure shows that over all tasks all eight classifiers perform well. Depending on the data set different methods perform better or worse, which largely explains the rather small differences in the overall error rates. Nevertheless, SVM is the method that has the minimum error rate (0.077) over the 10 tasks. LMNN and GMLVQ perform only a little worse reaching an average error rate of 0.079. k-NN is the worst performing classifier reaching an average error rate of 0.113. CLAM performs comparably to the top three methods ( o1% higher overall error) and very similarly to DCA. What is of conceptual value is that CLAM outperforms all other methods on occasion. Compared to the best scoring classifier (SVM), the proposed method is performing better for 3 out of

37

10 data sets, whereas compared to the worst (k NN), CLAM has lower error rates for 7 out of 10 sets. 3.2. Processing times We also present the processing time each method needed to complete each classification task. In those cases where no separate test set was provided, one has been formulated by choosing a random 30% of the data of each class of each data set. In this manner we ensure that all methods are applied on the same training and testing data, thus rendering the comparison fair and valid. The results presented in Table 3 refer to one run of the training and testing phases of each method. All experiments have been conducted under the same hardware and software setup. The fact that CLAM is consistently faster than SVM constitutes the most interesting remark at this point of the analysis. For small training sets in particular, such as the Image Segmentation data set, the increase in efficiency reaches a factor of 600. This factor is, however, decreasing for larger training sets. Among the studied data sets the smallest speed-up occurs for the Handwritten Digits Recognition data set with a factor of around 2.5. For some data sets CLAM is providing a faster solution also when compared to LMNN, GMLVQ and ITML. 3.3. Method ranking Since the error rates observed for different methods in many cases do not differ significantly we introduce an additional measure of evaluation, this time of comparative nature. In Table 4 we present the ranking for each method for each data set among the total of 8 methods, together with the average ranking of each method over all data sets. The classifiers that most often rank first are SVM and GMLVQ, which also achieve the highest average ranking of 2.8 and 3 respectively. The results regarding classification accuracy also confirm that CLAM, LMNN, GMM and DCA are expected to perform very similarly. What is interesting, however, is the fact that although DCA reaches a lower overall error rate than GMM it achieves a worse ranking. With regard to CLAM, it is worth mentioning that it achieves rankings above its average for a wide variety of data sets, whereas the worst ranking is obtained for the Digits Recognition dataset. This fact alongside with the respective processing times indicates that CLAM is mostly suitable for data sets with fewer training points but it is not strongly affected by the dimensionality of the data or the number of classes. Further interesting observations emerge from studying how these methods perform in relation to each other. Table 5 shows how well the compared methods are performing in relation with the proposed one, when the latter performs better or worse than its average (rankings 1–3 and 4–8 respectively). It is clear that for those tasks that CLAM cannot handle very well the nearest

Table 2 Average error rates. Dataset

kNN

LMNN

GMLVQ

SVM

GMM

ITML

DCA

CLAM

Iris Balance scale Glass identification Wine Image segmentation Digits Parkinsons Wisc. breast cancer Ionosphere Haberman Mean error

0.033 0.088 0.033 0.230 0.123 0.018 0.139 0.067 0.134 0.261 0.113

0.047 0.050 0.051 0.067 0.011 0.018 0.128 0.055 0.111 0.252 0.079

0.040 0.088 0.089 0.006 0.084 0.025 0.113 0.021 0.080 0.245 0.079

0.027 0.006 0.084 0.045 0.084 0.012 0.139 0.038 0.066 0.265 0.077

0.027 0.044 0.070 0.006 0.243 0.027 0.185 0.046 0.080 0.245 0.097

0.027 0.050 0.094 0.056 0.075 0.021 0.154 0.063 0.125 0.255 0.092

0.020 0.075 0.056 0.011 0.125 0.037 0.108 0.050 0.111 0.278 0.087

0.027 0.044 0.061 0.006 0.121 0.046 0.164 0.045 0.100 0.245 0.086

38

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

Haberman

Ionosphere

WDBC

Parkinsons

Digits

Segmentation

Wine

Glass

kNN LMNN

Balance

GMLVQ SVM

Iris

GMM ITML DCA

Mean Overall

CLAM

5

10

15

20

25

30

error in % Fig. 3. Comparative classification error rate results.

Table 3 List of processing times (in seconds unless otherwise indicated). Dataset

kNN

LMNN

GMLVQ

SVM

GMM

ITML

DCA

CLAM

Iris Balance scale Glass identification Wine Image segmentation Digits Parkinsons Wisc. breast cancer Ionosphere Haberman

0.25 0.5 0.1 0.1 3.8 8.5 0.1 1 0.4 0.3

6.1 44 34 35 262  17 h 50 210 115 6.8

15.3 97 19 17.5 63.4  1h 29.2 83 53 22.4

74  1.5 h 105 228 1127 4 30 h 100 1615 353 253

0.2 4 1.5 h 0.1 0.1 4 1.5 h 2.5 0.5 0.3 0.1 0.1

30 53 7.8 41 110 346 7.8 9.6 7.2 6

0.3 0.8 0.2 0.1 1.2 637 0.1 0.7 0.3 0.2

0.5 103 9.3 0.8 1.8  12 h 4.9 102 26.8 19

neighbor techniques and GMLVQ exceed their average performance, while GMM performs similarly to CLAM and naturally worse than its average both in terms of ranking and error rates. GMLVQ follows a pattern of similar behavior with CLAM as well.

SVM, ITML and DCA only slightly deviate from their average rankings, yet they reach a higher overall error rate and a worse ranking. With regard to the tasks that CLAM seems to handle best, k NN and LMNN perform worse than average and GMLVQ

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

Table 4 Classification method rankings. Dataset

kNN LMNN GMLVQ SVM GMM ITML DCA CLAM

Iris Balance scale Glass identification Wine Image segmentation Digits Parkinsons Wisc. breast cancer Ionosphere Haberman Average rank

3 5 1 6 5 2 4 8 6 4 4.4

5 3 2 5 1 2 3 6 4 2 3.3

4 5 7 1 3 4 2 1 2 1 3

2 1 6 3 3 1 4 2 1 5 2.8

2 2 5 1 7 5 7 4 2 1 3.6

2 3 8 4 2 3 5 7 5 3 4.2

1 4 3 2 6 6 1 5 4 6 3.8

2 2 4 1 4 7 6 3 3 1 3.3

Table 5 Method ranking and average error rates in relation to CLAM performance. Method

k NN LMNN GMLVQ SVM GMM ITML DCA CLAM

CLAM better than average

CLAM worse than average

Rank

Error rate

Rank

Error rate

5.3 4.2 2.3 2.3 2 4 3.7 2

0.136 0.097 0.080 0.075 0.075 0.096 0.091 0.078

3 2 4 3.5 6 4.5 4 5.3

0.078 0.052 0.078 0.080 0.131 0.086 0.082 0.098

Table 6 Method ranking and average error rates in relation to GMM performance. Method

k NN LMNN GMLVQ SVM GMM ITML DCA CLAM

GMM better than average

GMM worse than average

Rank

Error rate

Rank

Error rate

4.8 3.8 2.6 2.4 1.6 3.4 3.4 1.8

0.149 0.105 0.092 0.082 0.080 0.103 0.099 0.084

4 2.8 3.4 3.2 5.6 5 4.2 4.8

0.076 0.053 0.66 0.071 0.114 0.081 0.075 0.087

improves as expected. Finally, it is worthy of note that among the data sets that CLAM does not handle very well, GMM is still the method with the worst ranking and error rate. GMM and CLAM establish a pattern of similar behavior towards different types of data. This might be due to their similarity in the way of modeling the data, although CLAM is in general performing better. In order to further establish that, we also compared how the eight methods perform in relation to the tasks that GMM performs better or worse than its average. Among the subset of tasks that GMM handles best, the difference from the CLAM results is relatively small. In addition, the results in Table 6 further confirm the behavior of nearest neighbor and LVQ techniques in comparison to both CLAM and GMM, but it is noteworthy that the variations are in this case much smaller.

4. Discussion A remark is due on the performance of SVMs. Here the whole family of SVM variants is treated as one, although during

39

experimentation we used many different types of SVMs and different approaches for multiclass problems. Note that SVMs is also the method with the most free parameters that need to be tuned, hence the extensive cross-validation over a wide range of such parameters. The interpretation of the SVM results should therefore be that for most data sets there is one SVM variant under a particular configuration of free parameters that can outperform other methods. However, the required procedure for obtaining such results is computationally very expensive as it is shown from the processing times reported. In contrast, CLAM performs comparably for many data sets with only a minimum of input parameterization required and much shorter processing times. With regard to the faster acquisition of comparable results the same stands for LMNN and GMLVQ as well. It is also worthy of note that only DCA achieves consistently shorter processing times than the proposed technique with comparable classification accuracy rates. However, DCA needs at least two input parameters to be tuned: the number of nearest neighbors used in the classification phase and the amount of constraints used in the learning phase. In our experiments though only the number of nearest neighbors is cross-validated. Fine tuning of the amount of constraints and thus the number of chunklets can by all means affect the classification accuracy and the total processing times. This fact does stress once more the advantage of minimum input parameterization of CLAM. In addition DCA only provides a global transformation, whereas CLAM yields a set of localized transformations for each detected sub-cluster and that has been proven favorable for many data sets. A remark is also due on the parametric nature of the proposed method. Indeed, the cluster means and covariance matrices are used as parameters of normal distributions. Note, however, that other methods, generally seen as non-parametric, also make implicit assumptions about the type of underlying distribution and compute parameters. For instance, the k NN algorithm makes the implicit assumption of uniform density distributions in the hyper-sphere in which the k nearest neighbors are found and uses a parameter k. The LMNN algorithm is a generalization in which the hyper-sphere is replaced by a hyper-ellipse(s), which next to k present the parameters of that algorithm. The proposed method relies on explicit probability estimation while other algorithms use distances to data points (k NN, LMNN) or prototypes (GMLVQ) for classification. This aspect is not of conceptual value: the proposed method can easily be adapted by using Mahalanobis distances to the clusters corrected with a term that takes into account the number of points in a cluster. Finally, one can compute distances to all points (similar to k NN and LMNN) instead of distances to cluster prototypes but this leads to increased computational complexity.

5. Summary and conclusions In this paper we introduced a novel technique named ClusterBased Adaptive Metric classification. Using data sets of labeled examples the proposed method separates, if needed, the data in every class into clusters using statistical reasoning. For each cluster a local PCA transformation of the space is computed and a Mahalanobis distance is derived in the transformed space. The cluster-membership likelihood for a new data vector is evaluated using a Bayesian scheme and the cluster-specific Mahalanobis distances. The class-membership probability for a given class is computed as the sum of all cluster-membership probabilities for the clusters in that class. Experimental results on multiple data sets show that CLAM can reach high accuracy levels and it is scaling well with regard to the number of classes and the dimensionality of the data.

40

I. Giotis, N. Petkov / Neurocomputing 81 (2012) 33–40

However, the step of creating the B reference data sets for the gap statistic is proven to be a noteworthy bottleneck for classes with too many data points. The computational demand of CLAM can nonetheless be reduced further by using the estimation logðW Rk i Þ ¼ logðW R1i Þð2=pÞlogðkÞ, rather than evaluating it for random uniformly distributed sets Ri for every value of k. CLAM consistently outperforms k NN and performs comparably with modern fast classification methods such as ITML and DCA. With regards to SVM, LMNN, and GMLVQ the proposed method reaches slightly lower accuracy levels but in many cases it offers a significant increase in computational efficiency, since it is only applying a single-pass batch processing of the data. We conducted an extensive comparison between GMM and CLAM, because of their similarities, and concluded that the latter is more likely to perform better both in terms of accuracy and speed. The experimental results demonstrate efficiency and robustness and suggest that CLAM can be applied to a wide variety of classification problems. The minimal input parameterization requirement is also an important characteristic of the proposed method. In practice CLAM requires only one parameter to be set, namely the number of maximum possible clusters for a given collection of data. This parameter can be configured by the user but it can also be set automatically in relation to the number of data points. The absence of any parameter optimization process offers to the proposed technique an advantage compared to other successfully applied classifiers in terms of both usability and computational efficiency. Our main conclusion is that the proposed CLAM algorithm can be as effective as other adaptive metric or prototype-based classification algorithms and yet can offer a considerable advantage in terms of usability and efficiency.

References [1] R. Duda, P. Hart, D. Stork, Pattern Classification, second ed., Wiley-Interscience, 2000. [2] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1) (1967) 21–27. [3] N. Shental, T. Hertz, D. Weinshall, M. Pavel, Adjustment learning and relevant component analysis, in: ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part IV, Springer-Verlag, London, UK, 2002, pp. 776–792. [4] J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighborhood component analysis, in: NIPS, 2004. [5] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning a Mahalanobis metric from equivalence constraints, J. Mach. Learn. Res. 6 (2005) 937–965. [6] K. Weinberger, L. Saul, Distance metric learning for large margin nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244. [7] T. Hastie, R. Tibshirani, Discriminant adaptive nearest neighbor classification, IEEE Trans. Pattern Anal. Mach. Intell. 18 (6) (1996) 607–615. [8] C. Domeniconi, J. Peng, D. Gunopulos, Locally adaptive metric nearestneighbor classification, IEEE Trans. Pattern Anal. Mach. Intell. 24 (9) (2002) 1281–1285. doi:10.1109/TPAMI.2002.1033219. [9] P. Vincent, Y. Bengio, K-local hyperplane and convex distance nearest neighbor algorithms, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, vol. 14, MIT Press, 2002. [10] A. Bar-Hillel, T. Hertz, N. Shental, D. Weinshall, Learning distance functions using equivalence relations, in: In Proceedings of the Twentieth International Conference on Machine Learning, 2003, pp. 11–18. [11] S.C.H. Hoi, W. Liu, M.R. Lyu, W. Ying Ma, Learning distance metrics with contextual constraints for image retrieval, in: Proceedings of Computer Vision and Pattern Recognition, Murray Hill, 2006, pp. 2072–2078. [12] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (1936) 179–188. [13] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin classifiers, in: COLT ’92: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM, 1992, pp. 144–152. [14] J. Weston, C. Watkins, Support vector machines for multi-class pattern recognition, in: Proceedings of the Seventh European Symposium on Artificial Neural Networks, 1999.

[15] J. Peng, D. Heisterkamp, H. Dai, LDA/SVM driven nearest neighbor classification, IEEE Trans. Neural Networks 14 (4) (2003) 940–942. doi:10.1109/ TNN.2003.813835. [16] J. Peng, D.R. Heisterkamp, H.K. Dai, Adaptive quasiconformal kernel nearest neighbor classification, IEEE Trans. Pattern Anal. Mach. Intell. 26 (5) (2004) 656–661. [17] H. Zhang, A.C. Berg, M. Maire, J. Malik, SVM-KNN: discriminative nearest neighbor classification for visual category recognition, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2006, pp. 2126–2136. doi:10.1109/CVPR.2006.301. [18] T. Kohonen, Self-Organizing Maps, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001. [19] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Networks 15 (2002) 1059–1068. [20] P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learning vector quantization, Neural Comput. 21 (2009) 3532–3561. [21] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: ICML, Corvalis, Oregon, USA, 2007, pp. 209–216. [22] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a dataset via the gap statistic, J. R. Stat. Soc. Ser. B 63 (2000) 411–423. [23] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39 (1) (1977) 1–38. [24] M.A.T. Figueiredo, A.K. Jain, Unsupervised learning of finite mixture models, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2000) 381–396. [25] S. Dudoit, J. Fridlyand, A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biol. 3 (7) (2002) 1–21. [26] I. Jolliffe, Principal Components Analysis, Springer-Verlag, New York, 1986. [27] B. Efron, Maximum likelihood and decision theory, Ann. Stat. 10 (2) (1982) 340–356. [28] A. Asuncion, D. Newman, UCI Machine Learning Repository URL /http:// www.ics.uci.edu/  mlearn/MLRepository.htmlS, 2007. [29] M. Little, P. McSharry, S. Roberts, D. Costello, I. Moroz, Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection, Biomed. Eng. Online 6 (1) (2007) 23. ¨ ainen, ¨ [30] P. Paalanen, J.-K. Kamarainen, J. Ilonen, H. Kalvi Feature representation and discrimination based on Gaussian mixture model probability densities —practices and algorithms, Pattern Recognition 39 (2006) 1346–1358. [31] J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information Theoretic Metric Learning, UT, Austin, /http://www.cs.utexas.edu/users/pjain/itml/S. [32] S. Canu, Y. Grandvalet, V. Guigue, A. Rakotomamonjy, SVM and Kernel Methods Matlab Toolbox, Perception Syste mes et Information, INSA de Rouen, Rouen, France, 2005. [33] C.-W. Hsu, C.-J. Lin, A comparison of methods for multi-class support vector machines, IEEE Trans. Neural Networks 13 (2002) 415–425.

Ioannis Giotis received the Diploma (2006) in Informatics and Telecommunications from the University of Athens, Greece and the M.Sc. degree with distinction (2008) in Technology Education and Digital Systems from the University of Piraeus, Greece. He has been an associate of the Institute of Informatics and Telecommunications of the National Center for Scientific Research, Athens, Greece. Currently, he is a PhD candidate at the Johann Bernoulli Institute for Mathematics and Computer Science of the University of Groningen, The Netherlands. His current research is in the field of pattern recognition focusing on the extraction and learning of visual patterns and their applications in the health care domain.

Nicolai Petkov received the Dr.sc.techn. degree in Computer Engineering (Informationstechnik) from Dresden University of Technology, Dresden, Germany. He is professor of Computer Science and head of the Intelligent Systems group of Johann Bernoulli Institute for Mathematics and Computer Science of the University of Groningen, The Netherlands. He is the author of two monographs and co-author of another book on parallel computing, holds four patents and has authored over 100 scientific papers. His current research is in image processing, computer vision and pattern recognition, and includes computer simulations of the visual system of the brain, computer applications in health care and life sciences and creating computer programs for artistic expression. Dr. Petkov is a member of the editorial boards of several journals.