Prototype reduction techniques: A comparison ... - Semantic Scholar

3 downloads 0 Views 702KB Size Report
b DEIS, Universitа di Bologna, Via Venezia 52, 47521 Cesena, Italy ..... and pi. 2 MATLAB code available in ''Clustering Toolbox'' (http://www.cs.ucl.ac.uk/staff/.
Expert Systems with Applications 38 (2011) 11820–11828

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Prototype reduction techniques: A comparison among different approaches Loris Nanni a,⇑, Alessandra Lumini b a b

Department of Information Engineering - University of Padua Via Gradenigo, 6/B - 35131- Padova Italy DEIS, Università di Bologna, Via Venezia 52, 47521 Cesena, Italy

a r t i c l e

i n f o

Keywords: Prototype reduction Nearest neighbor based classifiers Learning prototypes and distances Particle swarm optimization Genetic algorithm

a b s t r a c t The main two drawbacks of nearest neighbor based classifiers are: high CPU costs when the number of samples in the training set is high and performance extremely sensitive to outliers. Several attempts of overcoming such drawbacks have been proposed in the pattern recognition field aimed at selecting/generating an adequate subset of prototypes from the training set. The problem addressed in this paper concerns the comparison of methods for prototype reduction; several methods for finding a good set of prototypes are evaluated: particle swarm optimization; clustering algorithm; genetic algorithm; learning prototypes and distances. Experiments are carried out on several classification problems in order to evaluate the considered approaches in conjunction with different nearest neighbor based classifiers: 1-nearest-neighbor classifier, 5-nearest-neighbor classifier, nearest feature plane based classifier, nearest feature line based classifier. Moreover, we propose a method for creating an ensemble of the classifiers, where each classifier is trained with a different reduced set of prototypes. Since these prototypes are generated using a supervised optimization function, we have called our ensemble: ‘‘supervised bagging’’. The training phase consists in repeating N times the prototype generation, then the scores resulting from classifying a test pattern using each set of prototypes are combined by the ‘‘vote rule’’. The reported results show the superiority of this method with respect to the well known bagging approach for building ensembles of classifiers. Our results are obtained when 1-nearest-neighbor classifier is coupled with a ‘‘supervised’’ bagging ensemble of learning prototypes and distances. As expected, the approaches for prototype reduction proposed for 1-nearest-neighbor classifier do not work so well when other classifiers are tested. In our experiments the best method for prototype reduction when different classifiers are used is the genetic algorithm. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Currently, several machine learning applications require managing extremely large data sets with the aim of data mining or classification. In many problems a general purpose classifier based on the distance from a set of prototypes, i.e. nearest neighbor (NN) classification rule, has been successfully used. The good behavior of nearest neighbor based classifiers is related to the number of prototypes, but in many practical pattern recognition applications only a small number of prototypes is usually available and, typically, this limitation causes a strong degrade of the ideal asymptotical behavior of a nearest neighbor based classifiers (Bezdek & Kuncheva, 2001; Dasarathy, 1991). Unfortunately, another strong limitation exists: the computational cost of a nearest neighbor based classifier increases with the number of prototypes. In fact, nearest neighbor based classifiers require ⇑ Corresponding author. E-mail addresses: [email protected] (L. Nanni), [email protected] (A. Lumini). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.03.070

the storage of the whole training set, which, in some cases, may be very large and may require a high computation time in the classification stage. One possible solution to this computational problem is to reduce the number of prototypes, while simultaneously insisting that the performance on the reduced dataset are nearly as well as the performance based on the original whole set of prototypes. The idea of prototype reduction for classification purposes has been explored by many researchers and has resulted in the development of many algorithms (Bezdek & Kuncheva, 2001; Dasarathy, 1991) which are usually divided into two groups:  Selective, i.e. methods for prototype selection (or extraction) which concerns the identification of an optimal subset of representative objects from the original data.  Creative, i.e. approaches for prototype generation which involves the creation of an entirely new set of objects. At a second level the selective approaches can be further classified as:

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828

 Editing: concerning the processing of the training set with the aim of increasing generalization capabilities (i.e. removing ‘‘outlier’’ prototypes that contribute to the misclassification rate, or patterns that are surrounded mostly by others of different classes (e. g. Devijver & Kittler, 1980).  Condensing: concerning the selection of a subset of the training set without changing the nearest neighbor decision boundary substantially (i.e. leaving unchanged the patterns near the decision boundary and removing the ones far from the boundary) (e.g. Huang & Chow, 2006; Sánchez, 2004). From another perspective, the existing approaches can be classified as being either deterministic or non-deterministic, depending on whether or not the number of prototypes generated by the algorithms can be a priori fixed. An excellent survey of the field is reported in Bezdek and Kuncheva (2001) where several methods for finding prototypes are discussed: in Section 2 a brief review of some methods from both the classes is presented. Since prototype reduction techniques can be used for reducing the computational time and for improving the performance of a nearest-neighbor based classifier, in this work an exhaustive evaluation of different methods for creating a good set of prototypes is performed by coupling them with different classification approaches: i.e. 1-nearest-neighbor classifier; 5-nearest-neighbor classifier; nearest feature line based classifier (NFL); nearest feature plane based classifier (NFP). The prototype reduction techniques considered are: a creative approach based on particle swarm optimization (with two different initializations1); a creative approach based on a Genetic algorithm; the well-known method named learning prototypes and distances (LPD) (with two different initializations 1). Moreover we suggest a new method for the generation of ensembles based on multiple prototype generation: an easy way to improve the classification performance of a nearest-neighbor based classifier is to repeat, during the training phase, the prototype generation N times. Each of the resulting N sets of prototypes is used to separately classify each test pattern, finally the N scores are combined by the ‘‘vote rule’’. The main findings of this work are:  The creation of different edited training set is an effective way for obtaining an ensemble of classifiers.  LPD is well suited for standard nearest-neighbor classifiers, while the best method for prototype reduction using other nearest based classifiers (NFP, NFL) is the Genetic algorithm. The paper is organized as follows: in Section 2 a review of existing works about prototype reduction is reported, in Section 3 the systems tested in this paper are detailed, discussing different techniques for prototype generation and the classification systems, in Section 4 the experimental results are presented, finally, in Section 5 some concluding remarks are given.

2. Related works on prototype reduction 2.1. Creative methods A recent approach for prototype reduction, called learning prototypes and distances (LPD) (Parades & Vidal, 2006), is based on the search of a reduced set of prototypes and a suitable local metric for these prototypes. Starting with an initial random selection of a small number of prototypes, LPD iteratively adjusts their position and their local-metric according to a rule that 1

Random selection or clustering.

11821

minimizes a suitable estimation of the classification error probability. Parades and Vidal (2006) show that LPD outperforms several state-of-the-art approaches based on learning vector quantization (Kohonen, 2001). Nanni and Lumini (2009b) a relatively novel evolutionary computation algorithm, particle swarm optimization (PSO), is applied to generate an optimal set of prototypes. Starting from an initial random selection of a small number of training patterns, a new set of prototypes is generated, using the PSO with the aim of minimizing the classification error rate on the training set. The reported experiments demonstrate that the PSO approach gains very good performance and discover optimal solutions very quickly. A similar approach is proposed in Li et al. (2005),where the information initially contained in samples of each class is exploited to edit the training set, and then a condensing process is applied to retain only the prototypes in the boundaries. The combination of different techniques has been investigated in Dasarathy, Sanchez, and Townsend (2000) where prototype reduction techniques such as minimal consistent set selection are combined to editing techniques such as proximity graphs.

2.2. Selective methods One of the first selective approaches based on editing is proposed by Wilson (1972) and consists in the elimination of the patterns that are not correctly classified by the k-NN rule applied to the training set. Tomek (1976) an iterative application of the Wilson’s algorithm is proposed until the training set is not further modified or a maximum number of iterations are reached. Other variants of the Wilson’s algorithm are proposed in Vazquez, Sanchez, and Pla (2005). A genetic algorithm is detailed in Kuncheva (1995) as a way of editing the training set for NN classification. Another work based on an evolutionary algorithm for prototype selection is in García, Cano, and Herrera (2008) where a model of memetic algorithm is presented that incorporates an ad-hoc local search specifically designed for optimizing the performance of prototype selection step. The reported results show that the memetic approach outperforms several previously studied methods, especially when the dataset scales up. An empirical comparison of four representative evolutionary algorithms for prototype selection is presented in Ramón Cano, Herrera, and Lozano (2003); moreover, the same authors included in Ramón Cano, Herrera, and Lozano (2006) a comparison between evolutionary algorithms and nonevolutionary pattern selection algorithms reporting that the evolutionary methods outperform the non-evolutionary ones. Riquelme, Aguilar-Ruiz, and Toro (2003) a condensing approach for finding representative patterns is proposed, based on the idea of removing the patterns far from the classification boundaries, retaining only the patterns close to the decision boundaries. Another condensing approach, named ‘‘depuration’’ is proposed in Barandela and Gasca (2000): from the analysis of the k nearest neighbors of each pattern some patterns are re-labeled as belonging to a class c (independently of their original one) if at least k1 patterns among the k considered belong to the same class c; the main limitation of this method is that it is based on the variation of the labels of some patterns in the training set.

3. Proposed system In this section the proposed ensemble based on the perturbation of the training patterns is described: each classifier of the ensemble is trained by a different set of prototypes obtained from the original training set by iterating a prototype reduction technique. The schema of the whole approach is shown in Figs. 1 and

11822

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828

Seed generation

Training Set

Prototypes optimization

Prototype generation

1st set of prototypes







N set of prototypes

Train the Nth classifier

th

Prototype generation

Train the 1st classifier

TRAINING PHASE

Classify the pattern using the 1st classifier Query pattern

Vote rule

… Classify the pattern using the Nth classifier TEST PHASE

Fig. 1. Training and test phases of the proposed system.

of the sample xi to the cluster j, cj is centre of the cluster j, Q is the number of training samples, K is the number of clusters, and k  k is the Euclidean distance expressing the similarity between any measured data and the centre of a cluster. The number of clusters K is a parameter of the fuzzy c-means algorithm, which in this work is automatically calculated by means of the minimum message length criterion (Figueiredo & Jain, 2002). The final seed set is obtained by selecting the centers of the final set of clusters.

xi y

dL

pi

3.2. Prototypes optimization Fig. 2. The dL is the distance between a given test object y and the line that links xi and pi.

2, where the training and the testing phase are outlined. The components of the system are detailed in the following subsections. 3.1. Seed generation Many approached for prototype generation are based on the optimization of an initial population of prototypes (seeds), which are ‘‘moved’’ in order to maximize a fitness function. In most cases the initial solution is based on a random selection of a set of patterns from the available training set. In this work we use a clustering method for the generation of seeds based on the well-known fuzzy c-means clustering algorithm (Bezdek, 1981). In order to balance the number of prototypes per class, the training set is first divided into c subsets according to the labels of the patterns (where c is the total number of classes), then each subset is separately processed by fuzzy c-means2. The fuzzy c-means clustering is based on minimization of the following objective function:



Q X K X i¼1

In this section three prototype optimization procedures tested in this work are detailed. The first two requires an initial population of seeds, which can be obtained by random generation or using the clustering approach described above, denoted by CLU in the following, the third, which is a genetic algorithm, performs an ad hoc initialization. 3.2.1. Particle swarm optimization Particle swarm optimization (PSO) is a population based stochastic optimization technique (Kennedy & Eberhart, 2001) inspired by social behavior of bird flocking or fish schooling. PSO has been successfully applied to prototype reduction (Nanni & Lumini, 2009b) according to the following procedure: given an initial population of P ‘‘particles’’ (P is the swarm size), represented by their position in the N-dimensional space xi = (xi1, xi2, . . . , xiN), each particle is moved in the solution space according to a velocity vi = (vi1, vi2, . . . , viN) in order to optimize a fitness function. The position oi and the velocity vi of each particle change at each iteration are updated according to the following procedure:

vi ¼ w  v i þ c1  RandðÞ  ðoi  xi Þ þ c2  RandðÞ  uaij kxi  cj k2 ;

ð1Þ

j¼1

where a is the degree of ‘‘fuzzification’’ (a real number greater than 1, fixed to 1.25 in our experiments), uij is the degree of membership 2 MATLAB code available in ‘‘Clustering Toolbox’’ (http://www.cs.ucl.ac.uk/staff/ D.Corney/ClusteringMatlab.html).

xi ¼ xi þ v i ;

i ¼ i...P



 og  xi ;

ð2Þ ð3Þ

where oi is the best previous position of each particle according to the PSO fitness rule, g is the index of the best particle position, c1 and c2 (named acceleration constants) represent the weighting of the stochastic acceleration terms that pull each particle toward the local and global best positions (in our experiments c1 = 3 and

11823

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828

c2 = 1); Rand() is a random function in the range [0, 1]; w (named inertia weight) is a parameter that provides a balance between global and local exploration. The value vi is bounded by the parameter vmax = 0.1, which determines the maximum dimension of the steps through the solution space. The parameter w is computed at each iteration it = 1 . . . MAXIT, as a linear function of wstart = 2.5 and wend = 1 according to the following equation:



ðwstart  wend Þ  ðMAXIT  itÞ : MAXIT þ wend

ð4Þ

The rationale of Eq. (1) is that the first part provides each particle with a ‘‘memory’’ of its last velocity, which decreases at each iteration according to Eq. (3), the second part represents the optimization of a given particle according to its own flying experience, the third part (the ‘‘social’’ one) represents the optimization of a given particle according to the companions’ flying experience. PSO is adapted for prototype reduction by using a particle to represent a set of K prototypes and iterative updating the swarm positions according to Eqs. (1) and (2). The initialization is performed by randomly extracting seeds among the training patterns or taking the centroids of the fuzzy c-means clusters. The fitness function is the minimization of the classification error on the training set. In this work, we run PSO with S = 20 particles and setting MAXIT = 100. 3.2.2. Learning prototypes and distances The LPD method (Parades & Vidal, 2006) is a well known prototype reduction approach based on a weighted metric that, starting from an initial population of seeds, iteratively updates their positions and the weighs associated to each of them. The metric used by LPD is a weighted Euclidean distance from the set of prototypes pi 2 RN :

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N uX 2 dw ðx; pi Þ ¼ t wi ðjÞ2 ðxðjÞ  pi ðjÞÞ j¼1

where wi is a weight vector associated to the prototype pi. The idea of LPD is based on the minimization of a fitness function J which measures an approximate classification error using the nearest-neighbor rule applied to the actual set of prototypes and weights. The updating procedure iteratively evaluates each sample of the training set T and updates positions P and weights W associated with its nearest neighbor prototypes of the same class and of a different class according to the fitness function J as reported in the following pseudo-code: LPD(T, P, W) { k0 = 1; k = J(P, W); P0 = P; W0 = W; while (|k0  k| > e) k’ = k; for all x 2 T y= = sameClass(P, W, x) y– = diffClass(P, W, x) i = index(y=) k = index(y–)   ðx;y¼ Þ dw ðx;y¼ Þ 1 R1 ¼ S0 ddww ðx;y –Þ 2 dw ðx;y– Þ dw ðx;pi Þ

¼

¼

ðx;y Þ dw ðx;y Þ R2 ¼ S0 ðddww ðx;y – ÞÞ d ðx;y – Þ w

1 dw ðx;pk Þ2

for j = 1 . . . m p0i ðjÞ ¼ p0i (j)  m  w2i ðjÞ  (pi(j)  x(j))  R1 p0k ðjÞ ¼ p0k (j) + m  w2k ðjÞ  (pk(j)x(j))  R2 w0i ðjÞ ¼ w0i (j)  l  wi(j)  (pi(j)  x(j))2  R1

w0k ðjÞ ¼ w0k (j) + l  wk(j)  (pk(j)  x(j))2  R2 end end P = P0 ; W = W0 ; k = J(P, W); end }

where dw(, ) is the weighted Euclidean distance, the functions diffClass(P, W, x) and sameClass(P, W, x) find the nearest prototypes y– and y= of x that belong to a different class and to the same class of x, respectively, l and m are fixed parameters denoting the learning factors and the objective function is

JðP; WÞ ¼

1 X SðrðxÞÞ Q x2T bð1zÞ

be 1 with SðzÞ ¼ ð1þebð1zÞ and its derivative S0ðzÞ ¼ ð1þe . bð1zÞ Þ2 Þ

In our experiments b and e are set to 10 and 0.001, respectively and the number of patterns selected as starting prototypes is fixed to the 5% of the dimensionality of the training set. 3.2.3. Genetic algorithm A genetic algorithm (GA)3 is an optimization methods inspired by the process of the natural evolution, which iteratively evolutes a population of chromosomes (the candidate solutions of the problem). In order to design a GA for prototype reduction, GA is trained for clustering the training patterns according to a fitness function and the centroid of each cluster is selected as prototype. Fixed to K the number of resulting clusters, we propose an encoding scheme where each chromosome is a string whose length is determined by the number of patterns in the training set and whose values specifies a label l for the pattern which denotes a particular cluster. A label l can range among 0 and K where 0 denotes ‘‘no cluster’’. The initial population is a randomly generated set of chromosomes. Moreover, in our experiments, we added to the initial population a chromosome obtained as the result of a K-centers clustering algorithm on the training set (Hartigan, 1975). The number K of generated prototypes is set to 5% of the dimensionality of the training set. Our implementation of the basic operators of selection, crossover and mutation of GA is the following:  Selection: a cross generational selection strategy is used; assuming a population of size D (in this paper D = 50), the offspring double the size of the population and the best D individuals from the combined parent-offspring population are retained.  Crossover: uniform crossover is used, with crossover probability fixed to 0.96.  Mutation: the mutation is fixed to 0.02. 3.3. Classification systems In this section the nearest-neighbor based classifiers tested in this work are detailed: k-nearest-neighbor classifier, nearest feature line based classifier; nearest feature plane based classifier. 3.3.1. k-Nearest neighbor classifier (KNN) KNN is a simple classification method which does not require any prior knowledge about the distribution of the data: an object is classified based on closest training examples in the feature space according to a majority vote rule of its k nearest neighbors. If k = 1, then the object is simply assigned to the class of its nearest neighbor. In this work the Euclidean distance is used. 3 Implemented as in GAOT MATLAB TOOLBOX http://www.ie.ncsu.edu/mirage/ GAToolBox/gaot/.

11824

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828

3.3.2. Nearest feature line (NFL) The nearest feature line (NFL) classifier is an extension of KNN useful when more than one sample per class is available but the sample size is small with respect to the number of features. The idea of NFL (Li & Lu, 1999) is to take advantage of multiple samples per class to infer intra-class variations from them and extend their capacity to represent pattern classes. A set of feature lines that link each couple of training objects belonging to the same class are used to represent the class (instead of the points associated to the objects). The classification of an unknown object is performed according to the nearest feature line distance that is the minimal distance between the query object and its projection onto the feature lines generated between the training objects. Using a feature line instead of a single object for classification purposes gives the advantage of providing information about the possible linear variants of two sample points not covered by them. The main drawback is the computation complexity with large training sets, for that reason several methods have been proposed to reduce the computation cost of NFL (Zheng, Zhao, & Zou, 2004; Zhou, Zhang, & Wang, 2004). Nanni and Lumini (2008) an efficient NFL-based classifier is proposed, that reduces the number of feature lines considering only the lines that link the training patterns to a set of prototypes. An efficient method based on clustering is used for finding subgroups of similar patterns whose centroid is used as prototype. A learning method is used to iteratively adjust both position and local-metric of the prototypes. Given a training set T containing Q objects and a test object y, dL denotes the distance between y and its projection on the feature line that links a training pattern xi 2 T and a prototype of the same class pi (Figs. 1 and 2).

 !   ðy  xi ÞT ðpi  xi Þ   dLðy; xipi Þ ¼ y  xi þ ðpi  xi Þ  : T   ðpi  xi Þ ðpi  xi Þ 2

The classification is performed according to the nearest neighbor rule (Gao & Wang, 2007), considering all the feature lines that link the training patterns in T and the prototypes of the same class.

3.3.3. Nearest feature plane (NFP) The nearest feature plane (NFP) classifier Chien & Wu, 2002 is another extension of KNN, based on an idea similar to NFL where a set of feature planes (instead of feature lines) are used to represent each class. At least three linearly independent samples are needed for each class in order to define at least a feature plane. The classification of an object is performed according to the nearest feature plane distance, that is the minimal distance between the query object and its projection onto the feature planes. Unfortunately, the computation complexity with large training sets is higher than for NFL, for that reason a variant, named genetic feature plane classifier, is proposed (Nanni & Lumini, 2009a), that considers the feature planes built using few prototypes per class, obtained by a genetic-based condensing algorithm. First, a reduced prototype set P for each class is obtained selecting each prototype pi as the centroid of the patterns belonging to ith cluster of the class. Then distance dP between a test pattern y and the feature plane builds using three prototypes pi, pj and pz, belonging to the same class is calculated using the following equations:

dP ¼ ky  pyijz k;   T  1  pyijz ¼ pi pj pz  pi pj pz  pi pj pz  pi pj pz  y: The classification of y is performed according to the nearest neighbor rule, according to the class of the nearest feature plane.

Fig. 3. Pseudo-code of the proposed method.

Table 1 Characteristics of the datasets used in the experimentation: number of attributes (M), number of samples (T), number of classes (C).

DATASET IO HEART PIMA BR WINE HIV SPECTF PROC VEHICLE

M

T

C

34 13 8 9 13 50 44 30 18

351 150 768 699 178 362 267 996 846

2 2 2 2 3 2 2 3 4

3.4. Ensemble creation The method for ensemble creation proposed in this work and named ‘‘supervised bagging’’ is based on the generation of N different prototype sets from the same training set and the combination by the ‘‘vote rule’’ (Kittler, Hatef, Duin, & Matas, 1998) of the resulting N classifiers (here N = 9). Repeating the prototype generation step N times, the number of prototypes to be stored is K  N, which is anyway lower than the cardinality T of the original training set. The pseudo-code of the ‘‘supervised bagging’’ method, divided in training and test phases, is reported in Fig. 3. Given in input a training set TR containing T labeled samples xi for the training phase and a set TE of E samples yi for the test phase, during the training, the T samples are processed N times by means of the function Proto() in order to obtain a set PR of K labeled prototypes. During the test, the unknown samples from the test set are assigned to a class according to a given set of prototypes by means of the function Classify(), then the final decision is obtained by applying the function VoteRule() to the set of labels K predicted by the N classifiers. 4. Experiments We perform experiments in order to: (i) compare the different classifiers; (ii) compare the different approaches for prototype reduction; (iii) compare the classification performance of our best method with other nearest neighbor based classifiers. The experiments have been conducted on nine benchmark datasets, eight are from the UCI Repository4 (Ionosphere (IO), Heart (HE), Pima Indians Diabetes (PI), Wisconsin Breast Cancer Databases (BR), Cardiac Single Proton Emission Computed Tomography (SPECTF), Wine (WI), Vehicle classification dataset (VEI), and 4

http://archive.ics.uci.edu/ml/datasets.html.

11825

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828 Table 2 Performance obtained by 1NN.

METHOD (Stand Alone SA/ Ensemble EN)

DATASET

1NN

ALL

CLU

LPD

CLU+LPD

PSO

CLU+PSO

GA

SA

SA

EN

SA

EN

SA

EN

SA

EN

SA

EN

SA

EN

IO

14.2

16.2

16.2

14.1

8.9

10.4

10.1

13.5

10.6

13.7

9.6

10.7

10.6

HEART

21.6

23.4

23.4

17.6

14.9

18.1

18.1

23.2

16.2

17.6

16.5

19.2

16.5

PIMA

29.4

27.9

27.3

27.8

25.8

27.3

27.2

26.6

26.0

26.3

24.4

28.8

26.4

BR

4.0

7.1

6.7

3.6

3.3

4.2

3.8

4.1

3.8

4.0

3.3

3.7

3.5

WINE

5.6

5.6

5.1

4.6

3.6

3.6

3.6

7.6

4

9.7

4

6.6

4

6

17.9

17.7

20.1

13.8

19.3

14.0

12.4

9.3

HIV

23.2

10.8

10.8

13.0

18.5

SPECTF

31.4

25.8

23.5

23.6

22.4

23.2

23.1

20.9

20.9

22.7

21.3

25.4

26.1

PROC

15.9

19.5

18.1

16.3

13.4

19.1

18.4

22.3

20.4

24.7

21.6

19.5

16.4

VEHICLE

32.0

38.0

37.4

30.1

26.1

28.4

27.3

39.7

35.1

41.6

35.7

43.3

40.1

9.2

9.5

8.7

7.1

4.6

7.2

6.4

9.1

6.2

9.1

5.9

8.9

6.7

RANK

Table 3 Performance obtained by 5NN.

METHOD (Stand Alone SA/ Ensemble EN)

DATASET

5NN

ALL

CLU

LPD

CLU+LPD

CLU+PSO

GA

SA

SA

EN

SA

EN

SA

EN

SA

EN

SA

EN

SA

EN

IO

16.6

21.8

21.0

29.1

27.4

25.6

24.5

42.5

29.7

42.5

32.2

13.8

13.2

HEART

19.2

16.0

15.7

21.1

16.3

17.1

15.7

22.9

15.5

22.4

16.3

16.3

14.7

PIMA

28.5

31.6

31.6

30.1

24.3

26.3

25.5

29.0

25.9

30.0

26.6

30.1

30.1

BR

3.6

7.9

7.2

4.8

5.1

5.0

5.0

3.8

3.3

3.5

3.1

4.6

4.7

WINE

4.6

7.7

7.6

10.7

8.7

9.2

9.2

15.3

8.1

17.8

9.6

7.1

6.6

HIV

25.1

29.0

29.0

31.2

31.5

31.5

31.5

27.6

22.6

28.5

28.5

11.3

10.8

SPECTF

25.4

28.7

28.7

23.9

23.9

21.3

21.3

25.8

23.1

27.6

23.8

22.0

22.0

PROC

17.1

79.7

81.5

31.7

27.5

28.9

28.7

53.6

26.7

41.7

23.9

15.5

14.8

VEHICLE

32.1

85.8

85.8

45.2

41.3

43.2

43.6

52.4

45.7

54.4

44.1

49.8

48.7

5.9

9.2

9.2

8.9

7.5

7.3

7.1

9.3

6.1

9.4

7.1

6.2

5.5

RANK

prokaryotic subcellular localization prediction (PROC)), and one is the HIV protease dataset5 (HIV) (Rögnvaldsson & You, 2004). The HIV dataset contains octamer protein sequences, each of which needs to be classified as an HIV protease cleavable site or uncleavable site. An octamer protein sequence is a peptide (small protein) denoted by P = P4P3P2P1P10 P20 P30 P40 . Where Pi is an amiP P no-acid belonging to ( = {A, C, D, . . . , V, W, Y}). The scissile bond is located between positions P1 and P10 . The dataset HIV contains 362 octamer protein sequences (114 cleavable and 248 uncleavable). To extract the features the standard orthonormal representation is used, each amino acid is represented by a 20 bit vector with 19 bit set to zero and one bit set to one, and each 5

PSO

http://idelnx81.hh.se/bioinf/data.html.

amino acid vector is orthogonal to all other amino acid vectors. Then the features are projected onto a lower 50-dimenional space by Karhunen–Loeve feature transform. A summary of the characteristics of these datasets (number of attributes, number of samples, number of classes) is reported in Table 1. As suggested by many classification approaches, all the datasets have been normalized between 0 and 1. To minimize the possible misleading caused by the training data, the results have been obtained using a two-fold cross validation on each dataset and averaged over ten experiments. In the following Tables 2–5 the performance of the tested systems are reported, in each cell of the table there are two values, the first is the error rate of the stand-alone method (SA), the latter is the error rate of the ensemble combined by vote rule (EN).

11826

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828

Table 4 Performance obtained by NFP.

METHOD (Stand Alone SA/ Ensemble EN)

DATASET

NFP

CLU

LPD

CLU+LPD

PSO

CLU+PSO

GA

SA

EN

SA

EN

SA

EN

SA

EN

SA

EN

SA

EN

IO

14.3

14.7

16.0

24.4

40.1

16.5

21.4

40.2

29.6

33.0

15.9

11.5

HEART

20.0

20.0

20.5

21.1

20.9

20.5

22.2

20.3

23.5

18.7

23.2

21.3

PIMA

33.7

33.4

33.4

31.6

35.5

33.4

33.7

29.8

43.9

30.3

32.6

32.4

BR

7.4

6.5

15.4

9.7

5.3

13.7

11.9

5

5.5

5.1

13.1

10.4

WINE

7.1

5.6

7.7

7.6

32.7

7.6

8.2

11.2

20.4

16.0

9.1

7.1

HIV

11.3

11.6

11.1

15.2

23.2

10.7

15.5

22.1

21.3

22.6

13.5

10.7

SPECTF

26.5

26.5

25.4

26.1

23.9

25.0

26.0

23.9

23.9

23.8

28.3

23.1

PROC

13.3

12.4

16.3

15.8

44.8

15.1

20.1

24.7

30.3

26.9

23.1

16.1

VEHICLE 27.9 27.8

30.9

27.0

46.8

29.6

33.5

38.1

52.1

36.7

36.9

30.1

6.9

6.4

9.0

6.3

8.2

7.1

9.2

6.8

8.2

5.3

RANK

5.5

5.2

Table 5 Performance obtained by NFL.

METHOD (Stand Alone SA/ Ensemble EN)

DATASET

NFL

CLU

GA

SA

EN

SA

EN

IO

13.5

13.1

12.4

11.8

HEART

19.6

19.1

19.6

19.6

PIMA

32.9

32.9

28.9

28.1

BR

7.2

7.0

5.3

4.0

WINE

6.6

5.1

5.6

5.1

HIV

12.4

11.8

12.4

11.8

SPECTF

26.1

25.3

26.5

27.1

PROC

12.7

12.2

14.1

11.7

VEHICLE

28.3

28.3

30.7

28.6

4.7

4.2

4.8

4.2

RANK

The tested systems are named as following:  ALL denotes the result of classification from the original training set, without prototype reduction.  CLU denotes the result of classification performed on a reduced set of prototypes obtained as the centroids of the clusters found by the Fuzzy clustering algorithm on the training set.  LPD denotes the result of classification from the reduced prototype set obtained by LPD with random initialization.  LPD + CLU denotes the result of classification from the reduced prototype set obtained by LPD initialized by clustering.  PSO denotes the result of classification from the reduced prototype set obtained by PSO with random initialization.

 PSO + CLU denotes the result of classification from the reduced prototype set obtained by PSO initialized by clustering.  GA denotes the result of classification from the reduced prototype set obtained by GA. The row RANK reports the average rank of the given classifiers in the tested datasets (e.g., if a classifier always obtains the best performance in each dataset, its rank is 1). All the ensemble methods work very well when the 1NN classifier is used, PSO works well in several dataset but it work bad in PROC and VEHICLE. So probably more studies should be performed on the parameters of the PSO optimizer on more datasets for building a more stable method.

11827

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828 Table 6 Comparison among different methods.

METHOD

DATASET

1NN 5NN 1NN+LPD NFP NFL-EN LP SA SA SA EN EN SA

SB-EN B-EN SV EN EN SA

SV-EN EN

IO

14.2

16.6

14.1

14.7

11.8

10.7

8.9

14.6

4.9

4.8

HEART

21.6

19.2

17.6

20.0

19.6

14.9

14.9

20.8

17.3

15.2

PIMA

29.4

28.5

27.8

33.4

28.1

27.0

25.8

30.1

23.5

24.6

BR

4.0

3.6

3.6

6.5

4.0

3.4

3.3

3.6

4.9

3.7

WINE

5.6

4.6

4.6

5.6

5.1

3.5

3.6

4.1

3.8

2.9

HIV

23.2

25.1

13.0

11.6

11.8

26.7

18.5

21.7

10.8

10.6

SPECTF

31.4

25.4

23.6

26.5

27.1

23.1

22.4

28.5

22.2

21.4

PROC

15.9

17.1

16.3

12.4

11.7

13.7

13.4

14.9

14.8

13.7

VEHICLE

32.0

32.1

30.1

27.8

28.6

28.9

26.1

30.5

25.8

20.2

8.6

8.1

6.7

7.6

6.7

5.6

4.8

7.9

5.2

4.4

RANK

Considering all the datasets the best approach is LPD. It is interesting to note that in no one dataset the best method is ALL. Using 5NN or NFP as classifiers only GA permits to obtain good performance in both the methods, notice that the other methods were created for optimizing a 1-NN classifier with few prototypes. Due to computational issue for NFL only the methods CLU and GA are tested and for NFP ALL is not tested. The following Table 6 reports a comparison among the best tested methods with other nearest neighbor based classifiers and the state-of-the-art among the classifiers, the Support vector machine (SV). Table 6 reports the error rate obtained by the following approaches: – kNN, the simple k-nearest neighbor classifier (Hart, 1968). – LP, the nearest neighbor algorithm of local probability centers (Li & Chen, 2008); this method performs an editing of the training set by reducing the number of negative contributing training samples, i.e. the known samples falling on the wrong side of the ideal decision boundary, and by restricting their influence regions. – 1NN + LPD, our best stand-alone method, i.e. a 1-NN classifier coupled with LPD. – SB-EN, our best ensemble method, i.e. a ‘‘supervised bagging’’ ensample of 1-NN classifiers coupled with LPD. – B-EN, a bagging ensemble of 1-NN classifiers (Breiman, 1996). – NFP-EN, the best configuration in Table 4, i.e. the NFP classifier coupled with CLU. – NFL-EN, the best configuration in Table 5 i.e. an ensemble of NFL classifiers coupled with GA. – SV, support vector machine, where the kernel and the parameters are optimized in each dataset. – SV-EN, a bagging ensemble of 50 support vector machines. From the results reported in Table 6 the following conclusions can be drawn:  As expected 5NN outperforms 1NN and 1NN + LPD outperforms both the previous ones.  The ‘‘supervised bagging’’ ensemble proposed in this work (SB-EN) outperforms in all the datasets the standard bagging one (B-EN).

 LP is a very good approach, in fact it outperforms all the standalone nearest neighbor based classifiers.  SB-EN outperforms all the other nearest neighbor based classifier and allows to reduce the gap performance between SVM and nearest neighbor based classification systems. This is interesting since SVM and k-nearest-neighbor systems use completely different approaches for classifying patterns, so this performance improving could be useful in order to study a fusion of k-NN and SVM. 5. Conclusion The problem addressed in this paper is to compare several prototype reduction techniques for nearest-neighbor based classifiers. Several classification approaches and prototype reduction techniques are coupled and their behaviors are compared. Our tests have shown that the prototype reduction techniques are a useful method for building an ensemble of classifiers; moreover, even if the number of retained patterns of the ‘‘supervised bagging’’ ensemble is evidently higher than a standalone approach based on prototype reduction, the final number of patterns is, in any case, lower than the original training set. The performance of the proposed algorithm has been evaluated on several UCI datasets and compared with some state-of-the-art approaches for prototype reduction and classification based on the nearest-neighbor rule. Our tests show that the proposed ‘‘supervised bagging’’ method is an effective way for building an ensemble of classifiers, that permits to reduce the performance gap between support vector machines and the k-nearest-neighbor based classifiers. As future work we want to design a k-nearest neighbor based classifier which can exploit of the kernel trick of SVM; some works as (Yu, Ji, & Zhang, 2002) it is shown that coupling NN and kernel trick permit to boost the performance of NN. References Barandela, R., & Gasca, E. (2000). Decontamination of training Samples for supervised pattern recognition methods. In Proceedings of the joint IAPR international workshops SSPR and SPR 2000 (pp. 621–630). Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum.

11828

L. Nanni, A. Lumini / Expert Systems with Applications 38 (2011) 11820–11828

Bezdek, J. C., & Kuncheva, L. (2001). Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16, 1445–1473. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Chien, J. T., & Wu, C. C. (2002). Discriminant wavelet faces and nearest feature classifiers for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1644–1649. Dasarathy, BV. (1991). Nearest neighbor (NN) norms: NN pattern classification techniques. Los Alamitos, CA: IEEE Press. Dasarathy, B. V., Sanchez, J. S., & Townsend, S. (2000). Nearest neighbour editing and condensing tools – Synergy exploitation. Pattern Analysis and Applications, 3, 19–30. Devijver, P. A., & Kittler, J. (1980). On the edited nearest neighbor rule. In Proceedings of the 5th international conference on pattern recognition, Miami, FL (pp. 72–80). Figueiredo, M., & Jain, A. K. (2002). Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 381–396. Gao, Q-B., & Wang, Z-Z. (2007). Center-based nearest neighbour classifier. Pattern Recognition, 40(1), 346–349. García, S., Cano, JR., & Herrera, F. (2008). A Memetic algorithm for evolutionary prototype selection: A scaling up approach. Pattern Recognition, 41(8), 2693–2709. Hart, P. (1968). The condensed NN rule. IEEE Transactions on Information Theory, 14(3), 515–516. Hartigan, J. (1975). Clustering algorithms. New York: Wiley. Huang, D., & Chow, T. W. S. (2006). Enhancing density-based data reduction using entropy. Neural Computation, 18(2), 470–495. Kennedy, J., & Eberhart, RC. (2001). Swarm intelligence. Morgan Kaufmann. Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239. Kohonen, T. (2001). Self-organizing maps (3rd ed.). Springer. Kuncheva, L. I. (1995). Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern Recognition Letters, 16, 809–814. Li, Y., Li, J., Huang, W., Zhang, X., Zhang, Y., & Li, J. (2005). New prototype selection rule integrated condensing with editing process for the nearest neighbor rules. In Proceedings of the IEEE international conference on industrial technology (pp. 950–954). Li, B., & Chen, Y. (2008). The nearest neighbor algorithm of local probability centers. IEEE Transactions on Systems, Man, and Cybernetics Part B – Cybernetics, 39(1).

Li, S. Z., & Lu, J. (1999). Face recognition using the nearest feature line method. IEEE Transactions on Neural Networks, 10, 439–443. Nanni, L., & Lumini, A. (2008). Cluster-based nearest neighbour classifier and its application on the lightning classification. Journal of Computer Science and Technology, 23(4), 573–581. Nanni, L., & Lumini, A. (2009a). Genetic nearest feature plane. Expert Systems with Applications, 36(1), 838–843. Nanni, L., & Lumini, A. (2009b). Particle swarm optimization for prototype reduction. NeuroComputing, 72(4–6), 1092–1097. Parades, R., & Vidal, E. (2006). Learning prototypes and distances: A prototype reduction technique based on nearest neighbor error minimization. Pattern Recognition, 39, 180–188. Ramón Cano, J., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: An experimental study. IEEE Transactions on Evolutionary Computation, 7(6), 561–575. Ramón Cano, J., Herrera, F., & Lozano, M. (2006). On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining. Applied Soft Computing, 6, 323–332. Riquelme, J. C., Aguilar-Ruiz, J. S., & Toro, M. (2003). Finding representative patterns with ordered projections. Pattern Recognition, 36, 1009–1018. Rögnvaldsson, T., & You, L. (2004). Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics, 20(11), 1702–1709. Sánchez, J. S. (2004). High training set size reduction by space partitioning and prototype abstraction. Pattern Recognition, 37(7), 1561–1564. Tomek, I. (1976). An experiment with the edited nearest neighbor. IEEE Transactions on Systems, Man, and Cybernetics, 6(2), 121–126. Vazquez, F., Sanchez, J. S., & Pla, F. (2005). A stochastic approach to Wilson’s editing algorithm. In Proceedings of the Iberian conference on pattern recognition and image analysis (pp. 35–42). Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on System, Man and Cybernetics, 408–421. Yu, K., Ji, L., & Zhang, X. G. (2002). Kernel nearest-neighbor algorithm. Neural Processing Letters, 15, 147–156. Zheng, W., Zhao, L., & Zou, C. (2004). Locally nearest neighbor classifiers for pattern classification. Pattern Recognition, 37, 1307–1309. Zhou, Y., Zhang, C., & Wang, J. (2004). Tunable nearest neighbour classifier. Lecture Notes in Computer Science, 3175, 204–211.