Differentiating Between Individual Class ... - Semantic Scholar

11 downloads 8102 Views 200KB Size Report
Programming Fitness for Classification with Unbalanced Data. Urvesh Bhowan, Mengjie ... It involves the evolution of computer programs to solve a particular ...
Differentiating Between Individual Class Performance in Genetic Programming Fitness for Classification with Unbalanced Data Urvesh Bhowan, Mengjie Zhang and Mark Johnston Abstract— This paper investigates improvements to the fitness function in Genetic Programming to better solve binary classification problems with unbalanced data. Data sets are unbalanced when there is a majority of examples for one particular class over the other class(es). We show that using overall classification accuracy as the fitness function evolves classifiers with a performance bias toward the majority class at the expense of minority class performance. We develop four new fitness functions which consider the accuracy of majority and minority class separately to address this learning bias. Results using these fitness functions show that good accuracy for both the minority and majority classes can be achieved from evolved classifiers while keeping overall performance high and balanced across the two classes.

I. I NTRODUCTION Classification is the act of placing an object into a set of classes or categories based on the object’s properties or features [1]. Classification tasks arise in a wide range of real world applications; medical diagnosis [2] [3] [4], fraud detection [5], and image recognition [6] [7] are just a few examples. Given the abundance of information now being captured and stored in electronic form, systems that can automatically search for and identify valid and useful patterns in data with as little human intervention as possible are fast becoming highly desirable solutions. Genetic Programming (GP) is a promising machine learning and search technique which has been successful in building reliable classifiers to solve classification problems [8] [9] [10]. It involves the evolution of computer programs to solve a particular problem, based on the principles of Darwinian evolution or natural selection [11]. In GP, programs are selected for recombination based on their ability at solving a particular problem (fitness). In most current GP approaches to classification, a program’s fitness is its overall accuracy on the learning data (training set) [3] [9] [12]. For most problems, using a standard measure such as classification accuracy or error rate on the training set to evaluate fitness is reasonable enough. However, for those classification problems with an uneven distribution of class examples in the training set, such measures have been shown to be unsuitable [13] [14]. Data sets have an uneven distribution of class examples when at least one class only has a small number of examples (called the minority class) while the other class(es) make up the rest (called the majority class). Various machine learning All the authors are with the School of Mathematics, Statistics and Computer Science, Victoria University of Wellington, P.O. Box 600, Wellington, New Zealand. {urvesh.bhowan,mengjie.zhang,mark.johnston}@mcs.vuw.ac.nz

approaches have shown that using an uneven distribution of class examples in the learning process can leave the learning algorithm with a performance bias: high accuracy on the majority class(es) but poor performance on the minority class(es) [15] [16]. In classification problems where minority class examples are rare but important to detect such as medical diagnosis or fraud detection, correcting this learning bias to accurately classify both majority and minority class examples is an important area of research [14],[13]. Recent approaches have tended to involve two aspects. The first involves sampling the original data set to select a more equal representation of examples to be used in the learning algorithm, and the second involves various forms of cost adjustment within the learning algorithm to factor in the uneven representation of examples in each class. While sampling techniques are effective [17], they can add a computationally expensive overhead, can lead to over-fitting, and require a priori task specific knowledge about the data [18]. Due to these limitations, researchers, particularly in the GP community, have focused on algorithmlevel approaches [15] [16], which is the focus of this paper. Algorithm-level approaches have included assigning weights to class examples in training [19], using cost or penalty functions when evaluating algorithm performance [18], and considering the classification accuracies of both minority and majority classes when evaluating algorithm performance using a mixture of metrics such as the area under the receiver operating characteristic curve (AUC) [20]. While these approaches have substantially improved minority class performance, the overall accuracy can still be improved as in many cases, gains in minority class accuracy has come at the expense of majority class performance. This paper aims to adapt the fitness functions in GP to evolve classifiers with high and reasonably balanced minority and majority class accuracy, while still trying to keep overall classification performance as high as possible. We first examine a GP approach using the standard classification accuracy as the fitness function on three data sets, each with varying levels of class imbalance, to demonstrate the effect which learning with unbalanced data can have on minority class performance. Four new fitness functions based on differentiating between the performance of the minority class and majority class together with other classification objectives are developed to improve the accuracy of both classes. Specifically, the four fitness functions aim to investigate: the effect of a different averaging function, the geometric mean, when evolving GP with two objectives; combining the overall accuracy into program fitness along

with the accuracy of both classes; using classifier output in fitness as a means to evaluate a program’s classification model; and the application of a well known approximation to the AUC, the Wilcoxon-Mann-Whitney statistic, on the class imbalance problem. The rest of this paper is organised as follows. Section 2 discusses some related work. Section 3 describes the three data sets used in the experiments. Section 4 describes the GP approach with standard classification accuracy as the fitness function. Section 5 examines the learning bias in the class imbalance problem closer and discusses a strategy to adapt the fitness function to address this. Section 6 describes the new fitness functions and their results on the three data sets. Section 7 concludes the paper and gives directions for future work. II. R ELATED W ORK ON C LASS I MBALANCE P ROBLEMS Much of the related work for dealing with classification problems with unbalanced data sets tends to involve two aspects [13]. The first focuses on different forms of sampling or editing the original data set to select a more balanced representation of examples from the minority and majority classes, to be used in the learning algorithm. The second involves various forms of cost adjustment within the learning algorithm to carefully consider the uneven representation of examples in each class of the original data set. This paper will focus on the second aspect for reasons which will be discussed below. Sampling or data-level approaches have included bagging and boosting algorithms which use multiple individual classifiers trained using subsets of the original data, selected either by static re-sampling in bagging algorithms or active sampling in boosting algorithms [21]. Static re-sampling repeatedly selects random examples (with replacement) from the original data set to build a balanced subset of training examples [22], while active sampling favours selection of examples not already accurately learned (usually from the minority class in class imbalance problems) with weights to influence selection probability [23]. Other effective sampling techniques to boost minority class representation in unbalanced data sets have been to over-sample the minority class creating duplicate (minority class) training examples [24], under-sample the majority class [17], and various combinations of these two. As over-sampling does not introduce any new information into the learning process, and undersampling discards potentially useful learning examples from the majority class, much work has focused on discarding or editing specially selected majority class examples to downsize majority class representation in the training set [17] [25] [26]. Techniques include using Euclidean distance measures between majority class examples in feature space, eliminating noisy or atypical examples [25], and discarding majority class examples along the Tomek-link line, the borderline between opposite class examples nearest to each other [26]. Other sampling approaches have included Random Subset Selection (RSS) and Dynamic Subset Selection (DSS) [13], or a combination of the two as seen in [18]. In [18], the

authors used a two-level hierarchical RSS and DSS method which involved first sampling “blocks”of training examples using RSS and then using DSS to sample specific majority and minority class examples from those “blocks”. While these sampling techniques are effective, they can add a computationally expensive overhead to the learning process, as in many cases, sampling or editing must be applied repeatedly [16] [17]; can lead to over-fitting and poor generalisation by a classifier as potentially useful learning examples can be excluded from the learning process if not sampled; and can require a priori task-specific knowledge about the data [18]. Due to these limitations, researchers have focused on “internal” or algorithm-level approaches, that is, adjusting the learning algorithm to carefully consider the uneven representation of class examples in the original data set. Examples of adjustments to the learning algorithm include Bradley’s Area under the receiver operating characteristic curve (ROC) [20], weights being assigned to certain examples of the minority class [17], and Holmes’ penalty factors to control the rate at which False Negative predictions are penalised over False Positive predictions in medical classification problems with unbalanced data[27]. Recent approaches have also involved differentiating between the performance of the minority and majority class when evaluating classifier accuracy, as a means of keeping the accuracy balanced for both classes [24] [25] [26]. [24] used the geometric mean of the minority and majority class accuracy for training a UCS (rule-based supervised classification system) classifier, as did [25] for training a Multilayer Perceptron Network. [26] instead used the mean of the Precision and Recall rate for successfully training a Nearest Neighbour and C4.5 classifier. In GP approaches to the class imbalance problem specifically, Eggermont [19] developed an adaptive cost function for assigning and periodically re-weighting the error associated with certain hard-to-classify examples. Heywood et al. [18] developed three fitness functions for a multiple minority class classification problem, where four out of the five classes had a very small representation of training examples. These three fitness functions focused on the number of correct classifications for each minority class, dynamically adjusting weights assigned to certain examples as a reward for correctly classifying infrequently (minority class) examples, and a hierarchical two-tier evaluation process called “tie breaker” fitness involving a second criterion being employed to resolve classifier performance when the first criterion yielded equally performing classifiers, respectively. To counter the class imbalance in two benchmark binary classification problems, Winker et al. [10] used the weighted sum of three distinct performance criteria as a fitness measure: the root mean squared error (RMS) in training (weighted the highest), a measure of the level of agreement between predicted and expected classifier values using genetic program ranges, and a separability measure similar to the AUC. Doucett and Heywood [16] constructed a similar fitness function in their approach combining two performance metrics, the geometric

mean of the minority class and majority class accuracy and an estimate of the AUC, the Wilcoxon-Mann-Whitney statistic, as a fitness measure. Zhang and Patterson [15] developed three new fitness functions based on the average performance of the minority class and majority class, using increasing penalties for poor performance by one class over the other. III. C LASSIFICATION DATA S ETS The experiments in this paper used three benchmark data sets all chosen based on their uneven distribution of class examples. Two, SPECT and YEAST, were from the UCI Repository of Machine Learning Databases [28], and the third, an image data set comprising of a collection of face and non-face images, from the Center for Biological and Computational Learning at MIT [29]. SPECT Heart data. This data set contains 267 data instances (patients) derived from cardiac Single Proton Emmision Computed Tomography (SPECT) images. This is a binary classification task, where patient heart images are classified as normal or abnormal. The class distribution has 55 instances of the “abnormal” class (20.6%) and 212 instances of the “normal” class (79.4%), a class ratio of approximately 1:4. Each SPECT image was processed to extract 44 continuous features. These were further preprocessed to form 22 binary features (F1 –F22 ) that make up the attributes for each instance. There are no missing attributes. Details on each of the 22 features can be seen in [28]. YEAST data. This data set contains 1482 instances, considerably larger than SPECT data set, generated for the automatic prediction of protein localisation sites in yeast cells. Patterns consist of 8 numeric features calculated from properties of amino acid sequences (F1 –F8 ). Details on each of the 8 features can be seen in [28]. There are 9 distinct classes of interest in this data set, making this a multiple class classification problem. The distribution of examples for each class of the 9 classes (NUC, MIT, ME3, ME2, ME1, EXC, VAC, ERL, and POX) vary from 463 examples (32%) in the largest class to 20 examples (2%) in the smallest class. As the scope of this paper is to consider the effect of class imbalance data on binary classification problems with a minority and majority class, we can decompose this multiple class data set into many binary classification problems by considering only one of the above classes of interest as the “main” class. Each pattern can then be classified as either belong to the main class (minority class) or not, where everything else is the majority class. We have selected three different “main” classes, based on their decreasing representation of class examples in the data set: MIT, ME1 and POX. Each is aimed at showing the effects of class imbalance learning at different levels of imbalance, from MIT which is reasonably unbalanced with 244 examples (16%), to ME1 and POX with 44 (3%) and 20 (1.5%) examples respectively, which are heavily unbalanced. Table I summarises the number of examples in each “main” class.

TABLE I “M AIN ” CLASS IMBALANCE LEVELS FOR YEAST DATA Main Class yeastM IT yeastM E1 yeastP OX

Number of examples in “main” class 244 (16%) 44 (3%) 20 (1.5%)

Number of examples in non-main class 1238 (84%) 1438 (97%) 1462 (98.5%)

Imbalance Ratio 1:5 1:33 1:70

FACE image data: The CBCL Face data set was developed at the Center for Biological and Computational Learning at MIT. As the largest of the three data sets it contains 30,821 gray-scale PGM-format images of faces and non-faces (background), each 19×19 pixels in size. There are 2901 face (9.5%) and 28,121 (90.5%) non-face images, an imbalance ratio of approximately 1:10.

Fig. 1. Example face images (left and center) and non-face image (right).

As image features we used 14 low level pixel statistics F1 – F14 , which correspond to the mean and variance of the pixel values extracted at certain local regions within the image. These features represent overall pixel brightness/intensity and the contrast of a given region. Figure 2 illustrates the 14 regions where pixel statistics were used, where n is the width of the image. A

B

C

D

E

F

H

I

G

n

Mean F1 F3 F5 F7

SD F2 F4 F6 F8

Regions Upper left square A-B-E-D Upper right square B-C-F-E Lower left square D-E-H-G Lower right square E-F-I-H

Mean F9 F11 F13

SD F10 F12 F14

Regions “eye” region A-B-D-C “nose” region E-F-H-G “mouth” region I-K-L-J

n/2

A

E

F

G

H

C

B D

I J

K L

Fig. 2.

Features F1 -F14 for FACE data set.

For all these data sets, one half was used as the training set for learning a program classifier, and the other half was used as the test set for measuring the classification performance of the evolved program classifiers. The training set and the test set were uniformly randomly chosen from these data sets so that they have a similar number of examples for each class, preserving the class imbalance ratios between training and test set. IV. GP A PPROACH USING A S TANDARD F ITNESS F UNCTION In this section we describe the GP system using standard classification accuracy as the fitness function.

We used the tree-based structure to represent genetic programs [11]. The ramped half-and-half method was used for generating programs in the initial population and for the mutation operator. Standard tournament selection, and reproduction, crossover and mutation operators were used in the evolutionary process. A. Terminals and Functions The terminal set forms the input to the GP system. In this approach we used two kinds of terminals: feature terminals and constant terminals. Feature terminals correspond to the features or attributes from the examples in each of the problem sets. For the SPECT, YEAST and FACE data sets, 22, 12 and 14 attributes, respectively, were used as terminals. Similar to other example-based learning algorithms, these terminals remain unchanged throughout the learning (evolutionary) process. Constant terminals are floating point numbers randomly generated using a uniform distribution at the beginning of the evolutionary process. The function set forms the non-terminal nodes of programs. In this GP system we used the four standard operators and a conditional operator as the function set. F unctionSet = {+, −, %, ×, if} The +,− and × operators have their usual meanings (addition, subtraction and multiplication) while % means protected division, that is, usual division except that a divide by zero gives a result of zero. Each of these four operators take two arguments and return one. The conditional if function takes three arguments. If the first is negative, the second argument is returned; otherwise it returns the third argument. The if function allows a program to contain a different expression in different regions of feature space, and allows discontinuous programs rather than insisting on smooth functions. B. Program Output Translation and Evolutionary Parameters The output of a genetic program in the GP system is a floating point number. For classification problems, this single output value needs to be translated into a set of class labels. As all the three classification tasks have only two classes, a majority class and a minority class, we used the division between positive and negative numbers of a genetic program output as the separation of the two classes. In the GP system, the population size was 500, the rates used for crossover, mutation and reproduction are 60%, 30% and 10%, respectively, the tournament size for selection was 7, and the maximum program depth is 6. The evolution ran for 50 generations unless the optimal solution was found before that, in which case the evolution was terminated early. C. Standard Fitness Function: Classification Accuracy A standard fitness function is classification accuracy on the training set [3] [9] [12]. The classification accuracy of a genetic program classifier refers to the number of examples

for all classes in the training set that are classified correctly by the genetic program classifier as a proportion of the total number of training examples for that classification problem. Assuming hits are the number of examples a genetic program p correctly classifies, and that the total number of training examples for the minority and majority classes are Nmin and Nmaj , then the overall classification accuracy of the genetic program p will be: Accuracy =

hits Nmin + Nmaj

(1)

V. F ITNESS F UNCTIONS ON C LASS I MBALANCE DATA In this section we demonstrate the negative effects of using a standard fitness function such as overall classification accuracy on class imbalance problems, describe a strategy to address this learning bias in the fitness function, discuss its impact on overall performance, and finally discuss the limitations of this new approach and why it can be improved. A. Individual Class Performance in Fitness Clearly the fitness function described in equation (1), i.e., overall accuracy, considers all the fitness cases (training examples) in both classes equally important and does not take into account the fact that the number of examples for the minority class is much smaller than that for the majority class. To address this consideration in GP, recent developments have involved differentiating between the classification performance of the majority and minority classes in the fitness function [15]. AveAcc =

hitsmin Nmin

+ 2

hitsmaj Nmaj

(2)

min In equation (2), the first component hits Nmin is the accuracy of program p on the minority class only (number of correct classifications for minority class), and the second component hitsmaj Nmaj is the accuracy of program p on the majority class. The overall fitness of program p therefore is a linear combination of the two parts.

B. Experiment Results The results using the standard fitness function, equation (1), and the fitness function designed to factor in the performance of both classes, equation (2), on the three data sets are shown in Table II. All the experiments were repeated 50 times with a different random seed used for every run. The means and standard deviations of the 50 runs on the test set is reported. The first line shows that on the SPECT data set, the GP system using overall classification accuracy as the fitness function (Accuracy) achieved an average overall accuracy of 80.3% on all examples in the test set with a standard deviation of 2.9%, an average accuracy of 40.4% with standard deviation 21.3% on the minority class, and an average accuracy 90.9% with standard deviation 4.0% on the majority class.

TABLE II GP USING STANDARD OVERALL ACCURACY ON THE FIVE CLASSIFICATION TASKS .

R ESULTS ( MEAN )

Data Set SPECT yeastM IT yeastM E1 yeastP OX Face

OF

Fitness Function Accuracy AveAcc Accuracy AveAcc Accuracy AveAcc Accuracy AveAcc Accuracy AveAcc

Overall (%) 80.3 ± 2.9 75.3 ± 3.3 87.2 ± 0.9 81.3 ± 2.4 97.3 ± 0.4 96.6 ± 1.0 99.2 ± 0.1 80.0 ± 9.3 91.4 ± 0.2 71.0 ± 3.7

Minority (%) 40.4 ± 21.3 59.6 ± 9.1 45.3 ± 6.4 70.1 ± 2.8 35.1 ± 19.3 94.3 ± 6.7 62.4 ± 10.1 68.4 ± 15.0 15.5 ± 2.3 81.6 ± 3.7

Majority (%) 90.9 ± 4.0 79.5 ± 4.9 95.5 ± 1.2 83.4 ± 3.1 99.2 ± 0.4 96.6 ± 1.0 99.7 ± 0.1 80.1 ± 9.4 99.3 ± 0.2 69.9 ± 4.4

According to Table II, the overall accuracy on each of the five tasks is higher using the fitness function described in equation (1) than those using the fitness function from equation (2), as expected. If we examine these results closer however, it is clear that although these appear reasonably “good looking”, they are in fact very one-sided: majority class accuracy is high but minority class accuracy is poor. This is because nearly all the examples were classified as belonging to the majority class, implying the GP system using overall accuracy in the fitness function, equation (1), did not learn to classify the minority class examples very well for each task. Closer inspection of the results using the fitness function from equation (2) however, is quite different. It is immediately clear that the minority class accuracy is significantly higher and that the minority and majority class performances are much more balanced for all of the tasks. This is especially noticeable for the yeastM E1 problem and the FACE data set, where the minority class accuracy was improved from 35.1% and 15.5% to 94.3% and 81.6% respectively using the newer fitness function. C. Further Analysis The results from Table II clearly illustrate the learning bias which can occur when overall accuracy is used as the fitness function in GP and the data set is imbalanced, and how by considering the accuracies of both classes in the fitness function, more balanced performances can be achieved with reasonable results for both minority and majority classes. This improvement is not ideal however, as the gain on minority class accuracy has come at the expense of overall accuracy on the classification tasks for each of the data sets. This is clearly noticeable for the yeastP OX problem and the FACE data set, where the overall accuracy dropped by approximately 20%. This is due to a larger of number of examples belonging to the majority class now being misclassified driving the overall performance down. A better fitness function should be able to evolve classifiers with high and reasonably balanced minority and majority class accuracies, while still keeping overall classification performance high. We address this in the next section.

VI. N EW F ITNESS F UNCTIONS FOR C LASSIFICATION WITH U NBALANCED DATA This section describes a number of new fitness functions and presents results and analysis of their effectiveness in our GP system on the three data sets. A. New Fitness Functions To address the limitation discussed in section V-C, we developed four new fitness functions in the GP system to improve both the minority class performance and overall performance. These new fitness functions combine different objectives into a single fitness value. New Fit 1. The first new fitness function is similar to the fitness function described in equation (2) except it uses the geometric mean of the two objectives. This fitness function aims to investigate the use of a different kind of averaging function when combining the minority and majority class accuracy into a standard scalar fitness value. The geometric mean, unlike the arithmetic mean used in equation (2), has the property that if a single factor in the mean is zero, then the geometric mean itself will also be zero. As a fitness function, the geometric mean will penalise those classifiers with a zero percent accuracy on either class with a fitness of zero. s hitsmin hitsmaj × (3) N ewF it1 = Nmin Nmaj Similar to equation (2), hitsmin and hitsmaj above are the number of correct classifications of minority class and majority class examples, respectively, and Nmin and Nmaj are total number of examples in the minority class and majority classes. Equation (3) follows the general definition of the geometric mean, that is, the tth root of the product of t numbers [30]. New Fit 2. The second new fitness function builds on the fitness function described in equation (2) by adding a third objective: improving the overall performance. The idea is that overall performance should be considered alongside improving the accuracy of both classes – not severely compromised by an improvement in only one class. This fitness function, shown in equation (4), should drive the evolutionary process towards a solution that performs well on the whole data set while keeping the accuracy of both classes high. N ewF it2 =

hitsmin hitsmaj hits + + Nmin Nmaj N

(4)

In equation (4), the first two components are the same as in equation (2), minority class and majority class accuracy, respectively, while the third component is the overall accuracy from equation (1), that is, the number of correct classifications over the total number of examples in the training set N , where hits = hitsmin + hitsmaj and N = Nmin + Nmaj . Unlike the fitness function from equations (1) and (2), this fitness is not “standardised” to 1

(best), but rather optimal fitness (3) is obtained by scoring 1 (best) on all three objectives. New Fit 3. The third new fitness function has two more factors in addition to improving minority class and majority class accuracy, which are designed to favour program classifiers with smaller “ranges” of misclassifications (as per the Program Output Translation described in section IV-B). The idea here is that the more concise the classification model for program p is, the better its fitness [10]. N ewF it3 =

hitsmaj hitsmin + Nmin Nmaj + (1 − Rngmin ) + (1 − Rngmaj )

where Rngc =

(5)

P outmax + P outmin c c 2

In equation (5), the third and forth components are the scaled (0 to 1) average ranges Rngc of classifier outputs for all incorrectly classified examples (incorrectly predicted) for minority and majority classes c, where P outmax is the c largest classifier output for all incorrectly classified examples from class c, and P outmin is the smallest classifier output c for all incorrectly classified examples from class c, The ranges objective is intended to differentiate fitness between two programs p1 and p2 that achieve the same classification accuracy for both the minority and majority class (same number of correct and incorrect classifications), but use different classification models. In this case, N ewF it3 will rank program p1 fitter than p2 if its misclassifications generally fall closer to the required boundary region than p2. New Fit 4. The fourth fitness function is based on the Wilcoxon-Mann-Whitney (WMW) statistic, which is known to be an equivalent estimator for the area under a ROC curve, without having to construct the ROC curve itself [16] [31]. The area under the ROC curve (AUC) is a useful metric to measure classifier performance but unlike classification accuracy, generating the AUC requires multiple performance points (performance thresholds) which is computationally costly to produce [20]. PNmin PNmaj i=0

N ewF it4 =

j=0

I(xi , yj )

Nmin × Nmaj

(6)

where ( I(x, y) =

1 x > 0 and x > y 0 otherwise

Equation (6) conducts a series of pairwise comparisons on an example-by-example basis between minority class x and majority class y examples collecting “rewards” (1 point) for those cases in which indicator function I(x, y) enforces two constraints.

The first constraint, x > 0, requires that the minority class example x is classified correctly. Recall from “Program Output Translation” described in section IV-B, that a minority class example is correctly classified if the genetic program output is a positive number. The second constraint, x > y, requires that the genetic program output for minority class example x is larger than the genetic program output for majority class example y. This constraint ensures that while majority class example y may not be classified correctly (recall again that a majority class example is correctly classified if the genetic program output is a negative number), it is still a lesser value than x, that is to say, y falls closer to the classification boundary than x. The second constraint, x > y, for indicator function I(x, y) is designed to increase minority and majority class separability between evolved genetic programs p1 and p2. For example, if both p1 and p2 classified minority class example x correctly and majority class example y incorrectly, and the second constraint is true for p1 but not for p2, p1 will be considered fitter as majority class example y should occur closer to the classification boundary than minority class example x. Similar to equation (5), this constraint makes this fitness function sensitive to different program classification models and not just program accuracy in terms of the number of correct classifications for a class. B. Results and Analysis of New Fitness Functions In this set of experiments, the same parameter values and termination criteria are used as with the previous experiments. Similarly, all experiments were repeated 50 times with the mean and standard deviation of those runs on the test set reported. Included in table III are also the average run times (in seconds) for a single GP run using the specified fitness function. For each data set, the results using fitness function AveAcc or equation (2), from table II are included for comparison, followed by the results using the four new fitness functions. According to Table III, all of the four new fitness functions succeeded in improving the classification performance of either the minority or majority class, and in some cases both, while still keeping the performances reasonably balanced across both classes. This suggests that these improvements to the fitness function have been beneficial on these unbalanced data sets. The effects of the new fitness functions in terms of performance are more noticeable on some tasks than they are on others. For example, the range in overall classification performance on the yeastP OX varied from 76% to 94%, with minority class accuracy between 66% and 72% and majority class accuracy between 76% and 94% — a good deal more than the variation in performances from the same fitness functions on yeastM E1 , where improvements where on a much smaller scale (up to 4% for the minority class and only 1% for the majority class). This seems to be task dependent. In terms of the four new fitness functions, each had a different effect on the minority and majority class performance

TABLE III F ULL RESULTS OF THE GP

APPROACH USING NEW FITNESS FUNCTIONS

ON THE FIVE CLASSIFICATION TASKS .

Fitness Function SPECT AveAcc NewFit 1 NewFit 2 NewFit 3 NewFit 4 yeastM IT AveAcc NewFit 1 NewFit 2 NewFit 3 NewFit 4 yeastM E1 AveAcc NewFit 1 NewFit 2 NewFit 3 NewFit 4 yeastP OX AveAcc NewFit 1 NewFit 2 NewFit 3 NewFit 4 FACE AveAcc NewFit 1 NewFit 2 NewFit 3 NewFit 4

Overall (%)

Minority (%)

Majority (%)

Run Time

75.3 75.1 76.4 72.9 74.6

± ± ± ± ±

3.3 2.5 2.8 3.6 2.6

59.6 60.4 53.4 68.5 61.4

± ± ± ± ±

9.1 9.1 6.0 8.7 8.4

79.5 79.0 82.5 74.1 78.1

± ± ± ± ±

4.9 4.2 3.5 5.6 3.9

1.9s 2.0s 2.1s 2.5s 2.9s

81.3 82.1 84.5 80.2 79.9

± ± ± ± ±

2.4 2.1 1.9 2.8 2.0

70.1 71.5 65.4 71.9 71.9

± ± ± ± ±

2.8 3.3 3.0 4.3 2.4

83.4 83.9 88.3 81.8 81.4

± ± ± ± ±

3.1 2.9 2.7 4.0 2.6

4.0s 4.6s 4.2s 9.0s 37.11s

96.6 96.9 97.0 96.1 96.9

± ± ± ± ±

1.0 0.8 0.8 1.2 0.7

94.3 96.3 95.4 98.1 94.8

± ± ± ± ±

6.7 4.6 4.9 4.1 5.1

96.6 96.5 97.1 96.0 97.0

± ± ± ± ±

1.0 1.1 0.8 1.2 0.7

3.9s 6.9s 4.0s 9.2s 28.29s

80.0 79.1 94.2 76.5 79.8

± ± ± ± ±

9.3 6.4 4.3 8.6 5.6

68.4 69.8 66.4 72.6 65.6

± ± ± ± ±

15.0 12.7 11.3 10.9 13.1

80.1 79.6 94.6 76.5 80.0

± ± ± ± ±

9.4 6.5 4.4 8.7 5.7

7.3s 7.1s 6.7s 11.4s 30.28s

71.0 70.1 76.5 68.2 77.3

± ± ± ± ±

3.7 3.4 3.1 4.1 4.6

81.6 81.3 74.2 83.1 86.0

± ± ± ± ±

3.7 3.8 2.9 2.6 6.1

69.9 68.8 76.8 66.6 76.4

± ± ± ± ±

4.4 5.0 3.0 4.6 4.9

94.2s 93.4s 97.9s 168.5s 489.9s

TABLE IV S UMMARY OF RESULTS ON THE FIVE CLASSIFICATION TASKS FOR EACH FITNESS FUNCTION . Fitness Function AveAcc NewFit 1 NewFit 2 NewFit 3 NewFit 4

Overall (%) 80.8 80.6 85.7 78.8 81.7

Minority (%) 74.8 75.9 70.7 78.8 75.9

Majority (%) 81.9 81.6 87.9 79.1 82.6

based on their different objectives. For convenience, table IV provides a summary of the performance results presented in table III, showing the average classification performance of each fitness function across the five tasks. From table III and IV, some general trends across the five problems sets can be observed: 1) The best overall accuracy on all of the 5 tasks was achieved by New Fit 2 or equation (4). This was attributed to high majority class performances, where in all cases, this fitness function scored the best majority class accuracies. Although still reasonably good, this fitness function produced some of the lowest minority class accuracies from the four fitness functions suggesting it may be bias to majority class performance. This is not surprising given that the classification accuracy

factor in the fitness function considers all the training examples equally important. 2) New Fit 3 or equation (5) achieved the highest minority class accuracy on four out of the five tasks and the most balanced performance across both classes, that is to say, the smallest difference between minority and majority class performance, on most tasks (all except yeastM IT and FACE). This suggests that the additional program range measure in fitness had a positive effect on classifier performance especially for the minority class. 3) The fitness function using the geometric mean of the minority and majority class accuracy, New Fit 1 or equation (3), produced very similar results for both classes compared to equation (2) using the arithmetic mean. This suggests that the two use of either averaging functions for program fitness does not have a significant impact in performance. 4) The fitness function based on the Wilcoxon-MannWhitney (WMW) statistic, New Fit 4 or equation (6), achieved reasonably good and balanced performances across both minority and majority classes, where the accuracy of each class was usually higher or at least as high as the two mean-based fitness functions, equations (2) and (3) suggesting the WMW statistic is an effective classifier measure for class imbalance problems. In all cases however, the run times for programs evolved using this fitness function were significantly higher than the other fitness function. This was due to the overhead in the calculation of the WMW statistic, where pairwise comparisons between all examples were made for each individual fitness value. VII. C ONCLUSIONS The goal of this paper was to investigate the problems and recent developments in improving the fitness function for GP for classification tasks with unbalanced data sets, and develop new fitness functions to improve minority class performance as well as to maintain the overall classification accuracy. The goal was successfully achieved by examining GP with both overall classification accuracy as the fitness function and an improved fitness function using individual class accuracy on three unbalanced data sets, and by constructing four new fitness functions to improve on this classification performance. The experimental results showed that minority class performance was poor when the overall classification accuracy was used as the fitness function, and that the improvement gain in minority class performance when individual class accuracy was used in the fitness function tended to be at the expense of overall classification accuracy on a given task. At least one of the four new fitness functions were shown to further increase the performance of both the minority class and majority class on some tasks, thus also increasing the overall accuracy. The new fitness function with highest level of overall classification performance on all five tasks was the fitness

function using a third factor to maximise the overall accuracy while still keeping minority and majority class performance balanced. This improvement however was mainly from higher majority class accuracies rather than minority class accuracies. The fitness function using classifier ranges to differentiate between different classifier models scored the best minority class performances and produced the closest performances between the two classes than any of the other fitness functions suggesting that this fitness function was an effective program measure. The fitness functions based on the geometric and arithmetic mean of the minority and majority class accuracy produced programs with similar classification results. The fitness function using the WMW statistic achieved good and balanced performances across both minority and majority classes suggesting the WMW statistic is an effective classifier measure for class imbalance problems. Different fitness functions have their own advantages in certain aspects. These new developments provide choices for different classification problems. Depending on the goal of a particular task, one can choose a good fitness function from these new developments to carry out GP experiments. This work only examined binary classification tasks with class imbalances. We will investigate these new fitness functions on multiple class classification problems in the future. We will also investigate new methods of improving the performance of the minority and majority class together for class imbalance problems. ACKNOWLEDGEMENT This work was supported in part by the Marsden Fund council from the government funding (08-VUW-014), administrated by the Royal Society of New Zealand, and the University Research Fund (URF09-2399/85608) at Victoria University of Wellington for 2008–2009. R EFERENCES [1] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data mining to knowledge discovery: an overview,” Advances in knowledge discovery and data mining, pp. 1–34, 1996. [2] A. Innes, V. Ciesielski, J. Mamutil, S. John, and A. Harvey, “Landmark detection for cephalometric radiology images using genetic programming,” Proceedings of the 6th Australasian-Japan Joint Workshop on Intelligent and Evolutionary Systems, Canberra, pp. 125–132, 2002. [3] H. Gray and R. Maxwell, “Genetic programming for classification of brain tumours from nuclear magnetic resonance biopsy spectra,” Genetic Programming 1996: Proceedings of the First Annual Conference, pp. 424–430, 1996. [4] A. Akyol, Y. Yaslan, and O. Erol, “A genetic programming classifier design approach for cell images,” ECSQARU 2007, Lecture Notes in AI, vol. 4724, pp. 878–888, 2007. [5] D. Song, M. I. Heywood, and A. N. Zincir-Heywood, “A linear genetic programming approach to intrusion detection,” GECCO 2003, LNCS, vol. 2724, pp. 2325–2336, 2003. [6] D. Howard, S. Roberts, and R. Brankin, “Target detection of SAR imagery by genetic programming,” Advances in Engineering Software, vol. 30, pp. 303–311, 1999. [7] W. A. Tackett, “Genetic programming for feature discovery and image discrimination,” Proceedings of the 5th International Conference on Genetic Algorithms, pp. 303–311, 1993. [8] J. Eggermont, J. N. Kok, and W. A. Kosters, “Genetic programming for data classification: Partitioning the search space,” In Proceedings of the 2004 Symposium on applied computing (ACM’SAC 04), pp. 1001– 1005, 2004.

[9] T. Loveard and V. Ciesielski, “Representing classification problems in genetic programming,” Proceedings of the 2001 Congress on Evolutionary Computation, Seoul, vol. 12, pp. 1070–1077, 2001. [10] S. Winkler, M. Affenzeller, and S. Wagner, “Advanced genetic programming based machine learning,” Journal of Mathematical Modelling and Algorithms, vol. 6 (No 3), pp. 455–480, September 2007. [11] J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [12] R. Caruana, “Data mining in metric space: An empirical analysis of supervised learning performance criteria,” in Proceedings of ROC Analysis in AI Workshop (ECAI), pp. 69–78, ACM Press, 2004. [13] N. V. Chawla, N. Japkowicz, and A. Kolcz, “Editorial: Special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations Newsletter, vol. 6, pp. 1–6, June 2004. [14] G. M. Weiss and F. Provost, “Learning when training data are costly: The effect of class distribution on tree induction,” Journal of Artificial Intelligence Research, vol. 19, pp. 315–354, 2003. [15] G. Patterson and M. Zhang, “Fitness functions in genetic programming for classification with unbalanced data,” Proceedings of the 20th Australian Joint Conference on Artificial Intelligence, vol. 4830, pp. 769– 775, December 2007. [16] J. Doucette and M. I. Heywood, “GP classification under imbalanced data sets: Active sub-sampling and AUC approximation,” in Proceedings of EuroGP 08, pp. 266–277, 2008. [17] R. Barandela, J. S. Sanchez, V. Garcia, and E. Rangel, “Strategies for learning in class imbalance problems,” Pattern Recognition. Issue 3, vol. 36, pp. 849–851, 2003. [18] D. Song, M. Heywood, and A. Zincir-Heywood, “Training genetic programming on half a million patterns: an example from anomaly detection,” Evolutionary Computation, IEEE Transactions on, vol. 9, pp. 225–239, June 2005. [19] J. Eggermont, A. Eiben, and J. van Hemert, “Adapting the fitness function in GP for data mining,” EuroGP’99, LNCS, vol. 1598, pp. 193–202, 1999. [20] A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, pp. 1145–1159, 1997. [21] A. Mclntyre and M. Heywood, “Toward co-evolutionary training of a multi-class classifier,” In Proc. of The 2005 IEEE Congress on Evolutionary Computation, vol. 3, pp. 2130–2137, 2005. [22] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, issue 2, pp. 123–140, 1996. [23] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems, LNCS, vol. 1857, pp. 1–15, Springer-Verlag, 2000. [24] A. Orriols and E. Bernado-Mansilla, “Class imbalance problem in UCS classifier system: Fitness adaptation,” The 2005 IEEE Congress on Evolutionary Computation, vol. 1, pp. 604–611, September 2005. [25] R. Alejo, V. Garcia, J. M. Sotoca, R. Mollineda, and J.S.Sanchez, “Improving the classification accuracy of RBF and MLP neural networks trained with imbalanced samples,” in Proceedings: Intelligent Data Engineering and Automated Learning (IDEAL 2006): 7th International Conference, pp. 467–471, Springer-Verlag, September 2006. [26] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: one-sided selection,” in Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186, Morgan Kaufmann, 1997. [27] J. H. Holmes, “Differential negative reinforcement improves classifier system learning rate in two-class problems with unequal base rates,” in Koza et al, pp. 635–642, ICSC Academic Press, 1990. [28] C. Blake and C. Merz, “UCI repository of machine learning databases,” 1998. “http://archive.ics.uci.edu/ml”. [29] K.-K. Sung, Learning and Example Selection for Object and Pattern Recognition. PhD thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996. [30] S. D. Ravana and A. Moffat, “Score aggregation techniques in retrieval experimentation,” in Proceedings of the Australasian Database Conference (ACSW2009), pp. 57–65, 2009. [31] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz, “Optimizing classifier performance via the Wilcoxon-Mann-Whitney statistic,” in Proceedings of The Twentieth International Conference on Machine Learning (ICML’03), pp. 848–855, 2003.

Suggest Documents