Evolving Classifier Ensemble with Gene Expression ... - IEEE Xplore

0 downloads 0 Views 106KB Size Report
DM tool in these fields, its potential to combine with DM techniques has not ... in Weka platform, are applied to improve the learning abil- ity of GEP classifiers.
Evolving Classifier Ensemble with Gene Expression Programming Qu Li, Weihong Wang, Shanshan Han College of Software Engineering Zhejiang University of Technology Hangzhou, China, 310032 {liqu,wwh,hanss}@zjut.edu.cn Abstract Gene expression programming (GEP) is a kind of genotype/phenotype based Evolutionary Computation(EC) algorithm. GEP has been successfully applied in Data Mining(DM) fields such as regression, classification and association rules mining. Although GEP has been used as a raw DM tool in these fields, its potential to combine with DM techniques has not been well studied in both DM and EC fields. In this paper, two ensemble methods, namely bagging and boosting, together with other DM tools available in Weka platform, are applied to improve the learning ability of GEP classifiers. Results show that the two popular ensemble methods can improve classification accuracy of raw GEP classifiers. What’s more, bagging outperforms boosting in GEP classifier learning.

1

Introduction

Gene Expression Programming (GEP) is a kind of evolutionary algorithm[3]. GEP combines linear chromosome of fixed length in GA[11] and represents computer programs in the form of ramified structure of different sizes and shapes used in GP[8]. The successful combination of two methods based on different entities enables GEP to enjoy both the simplicity of GA and the flexibility of GP. GEP uses fix length strings as its genotypes. The genotypes are later expressed as phenotypes, i.e., Expression Trees(ETs) with different sizes and shapes. In GEP, a chromosome is composed of predefined number of genes. Each gene is divided into head and tail parts. When multiple genes in a chromosome are expressed, they are connected by linking functions such as arithmetical functions or boolean functions. This separation of genotypes and phenotypes, together with the structural organization of the chromosomes allow unconstrained genetic modifications, always producing valid expression trees. GP has long been blamed of its code bloat problem[16], and much work has

Third International Conference on Natural Computation (ICNC 2007) 0-7695-2875-9/07 $25.00 © 2007

Jianhong Li Dept. of Maths and Physics Hebei Normal Univ. of Sci. and Tech. Qinhuangdao,China,066004 [email protected]

been done in reducing the side effect of bloat. GEP doesn’t have this problem because it use fix-length chromosome. GEP has been used to solve many kinds of hard problems such as regression, optimization, logic synthesis and cellular automata, etc. It has also been successfully applied in the fields of DM[6] such as time series analysis[3][18], classification[3, 9, 10, 17], and association rules mining, results show that GEP’s performance are competitive with or even superior to some traditional DM methods. The goal of ensemble learning methods is to construct a collection of individual classifiers to obtain an improved composite model by voting the decisions of the individual classifiers in the ensemble. Ensemble learning operates by taking a base learning algorithm and running it many times with different training sets. The classifiers in the ensemble are applied to a classification task and their individual outputs combined to create a single classification from the ensemble as a whole. Bagging[1] and boosting[5] are two well known methods to construct ensemble by resampling techniques. Bagging builds bags of data of the same size of the original data set by applying random sampling with replacement. Boosting also resample original data set with replacement, but weights are assigned to each training sample. After a classifier is learned, the weights are updated to allow the subsequent classifier to pay more attention to the training samples that were misclassified by. The final classifier combines the votes of each individual classifier, where the weight of each classifier’s vote is a function of its accuracy. Both boosting and bagging are generic techniques that can be employed with any base classification technique. In general, current study shows that bagging is more consistent, increasing the error of the base learner less frequently than does boosting. However, boosting appears to have greater average effect, leading to substantially larger error reductions than bagging on average. The performance of boosting is dependent on the data set being examined, where bagging shows much less correlation. The strong correlations for boosting may be partially explained by its

sensitivity to noise. GEP has been proved to be effective and powerful in classification rules mining. However, the GEP classifier is not a stable one for its randomization. In order to avoid its side effect(and also for most genetic classifiers), several independent runs are usually made over an algorithm to get its average accuracy. Since ensemble is a method to enhance base classifier by running several times over an algorithm, it motivates us to introduce ensemble into the evolution of genetic classifiers. The only difference is that, the runs are independent in traditional EC learning process, while in ensemble leaning, these runs are inherently correlated. The remaining sections of this paper are organized as follows. In section 2, we give a brief description of GEP classifiers, explaining the main difference between several existing GEP classifiers. In section 3, we give a brief description of the two most popular ensemble methods, i.e., bagging and boosting. In section 4, we describe our way of constructing GEP classifier ensemble and explain the experiment and results on GEP classifier ensembles. The final section draws conclusion and some directions for future work.

2 2.1

Basic Gene Expression Programming Classifiers Gene Expression Programming

Gene Expression Programming is a kind of evolutionary algorithm[3]. It uses fix length strings as its genotypes. The genotypes are later expressed as phenotypes, i.e., Expression Trees(ETs) with different sizes and shapes. In GEP, a chromosome is composed of predefined number of genes. Each gene is divided into head and tail parts. The size of the head (h) is determined by the user, and the size of the tail (t) is computed as: t = h(n − 1) + 1, considering n the maximum arity in the function set for the particular problem. A gene can be mapped into an expression tree following a width-first fashion and be further written in a mathematical form. When multiple genes in a chromosome are expressed, they are connected by linking functions such us arithmetical functions or boolean functions. By this way, several ETs can combine to form a bigger ET. For example, a GEP chromosome consists of two genes could be as follows: +. − ./.a.a.c.b.d.a| ∗ . − . + .d.a.c.b.a.a The two ETs are connected by plus, thus a bigger ET is formed. The un-used part (called non-coding region) in each gene is the room for genetic variations. GEP uses several genetic operators: replication, mutation, recombination and transposition. These operators function on the chromosomes to create new populations.

Third International Conference on Natural Computation (ICNC 2007) 0-7695-2875-9/07 $25.00 © 2007

+ /

-

a a c b (a)ET1 = (a − a) + (c/b) * -

+

d a c b (b)ET2 = (d − a) ∗ (c + b) + ET1 ET2 (c)ET = ET1 + ET2 Figure 1. Expression tree composed of two genes linked by plus

From above example,we can find that in genotype space, when genetic operators modify a gene, what varies is not the length of gene, but the length of Karva expression. And in phenotype space, the sizes and shapes of ETs vary according to the change of corresponding Karva expression. We can also find that GEP’s nature of fixed-length string spares it from the problem of bloat. The genotype-phenotypemapping enables GEP to be more efficient than GA and GP in global search.

2.2

GEP Classifiers

Currently, there are several kinds of GEP classifiers. We give a brief description of these classifiers below. Ferreira[3] used a multigene system, head and tail method in the classifier. The fitness of an individual program corresponds to the number of instance it can correctly classify. Ferreira’s method used a multigene system to ensure efficiency of the system and head-tail method to ensure the validity of chromosomes, thus more efficient and powerful than GP classifiers. The classifier system decomposes a multiclass classification problem as multiple two-class problems by using the one-against-all learning method. The basic GEP classifier can only handle numerical attributes. In Ferreira’s new book[4], decision tree evolved by GEP also performs well on problem with both numeric and nominal attributes. Zhou[17] used a single gene system, and validity test to ensure the validity of chromosomes. Zhou used completeness and consistency gain to evaluate the fitness of rules. The multiclass classification problem is also formulated as multiple two-class problems by using the one-against-all

learning method. The covering strategy is applied to learn multiple rules if applicable for each class. Compact rule sets are subsequently evolved using a two-phase pruning method based on the minimum description length (MDL) principle and the integration theory. their approach is also noise tolerant and able to deal with both numeric and nominal attributes. Weinert[14] used different individual encoding to tackle with the closure requirement of logical or rational function of their GEPCLASS system. GEPCLASS used logical and rational functions instead of arithmetical functions to solve classification problems. It also used variable-length chromosome that can have one or more genes with different length of heads and tails. Another difference between GEPCLASS and other GEP classifier is that it used the product of sensibility and specificity as its fitness function(although it is avaiable in [3]). Marghny[9, 10] enhanced the original GEP technique by using a logical operators instead of mathematical ones to represent the chromosome validity evaluation, which results in unconstrained search of the genome space while still ensuring validity of the program’s output. The author also applied the enhanced model to microarray to extract logical classification rules.They also proposed a new GEP algorithm for discovering logical fuzzy classification rules. And the extracted rules by the GEP are much more accurate than the rules discovered by C4.5 and C4.5Rule.

3

Bagging and Boosting

Following [13], this section explains the basic ideas of Bagging and Boosting. The data for learning systems are supposed to consist of attribute-value vectors or instances. Both Boosting and Bagging manipulate the training data in order to generate different classifiers.

3.1

Bagging

3.2

Boosting

Boosting uses all instances at each repetition, but maintains a weight for each instance in the training set that reflects its importance Adjusting the weights causes the learner to focus on different instances and leads to different classifiers. When voting to form a composite classifier, each component classifier has the same vote in Bagging, whereas Boosting assigns different voting strengths to component classifiers on the basis of their accuracy. Let wxt denote the weight of instance x at trial t. Initially, wx1 = 1/N for every x. At each trial t = 1, 2, . . . , T , a classifier C t is constructed from the given instances under the distribution wt , as if the weight wxt of instance x reflects its probability of occurrence. The error t of this classifier is also measured with respect to the weights and consists of the sum of the weights of the instances that it misclassified. If t is greater than 0.5, the trials are terminated, and T is altered to t − 1. Conversely, if C t correctly classifies all instances so that t is zero, the trials terminate and T becomes t. Otherwise, the weight vector wt+1 for the next trial is generated by (1) multiplying the weights of instances that C t classifies correctly by the factor β t = t /(1 − t )and (2) renormalizing so that Σx wxt+1 is 1. The boosted classifier C ∗ is obtained by summing the votes of the classifiers C 1 , C 2 , . . . , C T , where the vote for classifier C t is worth log(1/β t )units. Freund and Schapire [5]introduced a new boosting algorithm, called AdaBoost, which theoretically can significantly reduce the error of any learning algorithm if its performance is a little better than random guessing. In many studies, both bagging and boosting have been shown to be very effective. Bagging generates diverse classifiers only if the base learning algorithm is unstable. Bagging can be viewed as ways of exploiting this instability to improve classification accuracy. AdaBoost requires less instability than bagging, because AdaBoost can make much larger changes in the training set by placing large weights on only a few of the examples.

4 BagGEP and BoostGEP Bagging produces replicate training sets by sampling with replacement from the training instances. For each trial t = 1, 2, . . . , T , a training set of size N is sampled with replacement from the original instances. This training set is the same size as the original data, but some instances may not appear in it while others appear more than once. The learning system generates a classifier C t from the sample and the final classifier C t is formed by aggregating the T classifiers from these trials. To classify an instance x, a vote for class k is recorded by every classifier for which C t (x) = k and C ∗ is then the class with the most votes (ties being resolved arbitrarily).

Third International Conference on Natural Computation (ICNC 2007) 0-7695-2875-9/07 $25.00 © 2007

Although weka[15] has been widely accepted and utilized by DM society, it seems that it has not been well exploited by EC learning researchers. Because weka has implemented user friendly interface and standard I/O packages, it is easy and convenient to embed genetic classifiers in the open source platform. In this paper, we implemented GEP classifier in weka. Now that basic GEP classifier is able to solve a binary classification without much change, we choose basic GEP classifier to solve binary class problems. And for multiple class problems, we use the MultiClassClassifier

in weka.classifier.meta package as meta classifier. MultiClassClassifier also solves multiple class problems with one-against-all strategy. In order to build GEP classifier ensembles. We use GEP classifier as base classifier and use Bagging and AdBoosting.M1 as meta classifier, i.e., we can get BagGEP with the combination of Bagging and GEP and get BoostGEP with AdBoosting.M1 and GEP. For missing value and the binarization of nominal and boolean attributes, we utilize default strategy in weka.core package. So the algorithm is able to solve problem with missing value, nominal, boolean and numerical attributes.

5

Experiments and Results

In order to test the effectiveness and features of BagGEP and BoostGEP classifiers. We ran the experiments on five UCI datasets[2]. These datasets represent different kinds of learning tasks, from binary class problems(breast-w, heartc, diabetes) to multiple-class problems(3 classes for iris, 6 classes for zoo), from problems without missing value to problems with missing value(breast-w, heart-c). Here, datasets with too many missing values or extremely unbalanced problems on which BagGEP and BoostGEP performs almost equally good as Basic GEP, are not considered in this paper. For GEP as base classifiers, we used 100 for population size, 300 for maximum generation, 3 for gene number and 8 for gene head and five fold cross-validation for all problems. For each problem, we chose a function set consists of mathematical and logical operators, e.g. +, −, ∗, /, sqrt and if . if is a logical comparison function with three arguments (x, y, z), which take the value: if x > 0 , then y else z . For other settings of genetic operators please refer to[3]. For BagGEP and BoostGEP, the number of iterations was both set to 10, other settings are set to default in weka version 3.5.3. Due to the randomization of GEP, for all experiments, five independent runs are performed to get the average, as well the best and worst results. Although it is not a tradition to report worst result in paper, here we want to show the stability of genetic learning systems from another perspective by doing this. Table 1 shows the detailed results of the algorithms on each data set. From the table 1, we can easily find that BagGEP and BoostGEP performs better than Basic GEP on all datasets(1% to 6% average accuracy improvement on each dataset). For all datasets, the worst and best accuracy as well as the average accuracy of BagGEP and BoostGEP are higher than (or at least equal to) its corresponding value of Basic GEP. More specifically, BagGEP gained an average improvement of more than 1% on breast-w, diabetes and iris,

Third International Conference on Natural Computation (ICNC 2007) 0-7695-2875-9/07 $25.00 © 2007

BagGEP also gained more than a 3% improvement on heartc. Among five datasets, BagGEP led to the highest accuracy improvement on zoo dataset(more than 7% improvement). For BoostGEP, it gained a slight improvement on breastw, diabetes and iris (less than 1% improvement). BoostGEP also gained more than a 3% improvement on heart-c. Among five datasets, BoostGEP also led to the highest accuracy improvement on zoo dataset(more than 6% improvement).

6

Conclusions and Future Work

In this paper,we implemented our GEP classifier by introducing GEP into weka platform. Then we constructed GEP classifier ensemble by using our GEP classifier as base classifier. Results show that two kinds of classifier ensembles, i.e., BagGEP and BoostGEP perform consistently better than base GEP classifier. That means bagging and boosting methods, together with other DM tools available in weka platform, can be used to enhance genetic classifiers. In order to get a general idea of the detailed experimental results, we summarize the result as follows: 1. All methods presented in this paper was able to solve a variety of binary and multiple classes problems with missing values, boolean, nominal and numerical attributes. 2. GEP classifier ensembles, i.e., both BagGEP and BoostGEP, outperformed base GEP classifier on all datasets. 3. BagGEP and BoostGEP performed similarly on all datasets, i.e., the gap between BagGEP, BoostGEP and Basic GEP on each dataset is similar. 4. BagGEP performed better than BoostGEP on all datasets. Although BagGEP and BoostGEP performed better than Basic GEP on all datasets, BoostGEP didn’t get higher accuracy than BagGEP as we’ve expected. Part of the reason may be the unstable property of GEP classifier, and bagging is more powerful than boosting in exploiting this instability to improve classification accuracy. However, this hypothesis needed further study to prove for GEP classifier. In order to understand the working mechanism of BagGEP and BoostGEP, we will do the bias-variance decomposition of the error to analyze the performance of BagGEP and BoostGEP. Future work also includes further study of GEP on other problem such as regression, association and cost sensitive learning, which has been embedded in weka platform.

Dataset breast-w diabetes heart-c iris zoo

Worst 94.28 67.45 72.94 94.00 83.17

Table 1. Experimental results of GEP classifier ensembles. Basic GEP BagGEP Best Average Worst Best Average Worst 95.14 94.59±0.38 95.42 95.99 95.74±0.21 94.28 69.27 68.62±0.80 69.27 70.18 69.71±0.33 67.45 79.87 77.63±2.75 78.88 81.85 80.66±1.11 79.87 95.33 94.93±0.60 96.00 96.67 96.53±0.30 94.67 86.14 85.15±1.40 91.09 94.06 92.48±1.13 91.09

Acknowledgment

BoostGEP Best Average 95.71 95.19±0.61 71.48 69.32±1.72 82.51 81.12±1.05 98.67 95.87±1.66 92.08 91.49±0.54

[9] M. H. Marghny, I. E. El-Semman, Extracting Fuzzy Classification Rules with Gene Expression Programming. In Proceedings of the International Conference on Artificial Intelligence and Machine Learning, AIML 2005, Cairo, Egypt, 2005.

This research was supported by the National Natural Science Foundation of China under Grant No.60473024 and Natural Science Foundation of Zhejiang Province under Grant No. Y105394 and Z105391 and Research Foundation of Zhejiang Education Administration under Grant No.20060815. We would like to thank Mr. Yang YU from Nanjing University for his kind help in programming on weka platform and comments on the manuscript.

[10] M. H. Marghny, I. E. El-Semman, Extracting Logical Classification Rules with Gene Expression Programming: Microarray Case Study. In Proceedings of the International Conference on Artificial Intelligence and Machine Learning, AIML 2005, Cairo, Egypt, 2005.

References

[11] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, 1996.

[1] L. Breiman. Bagging predictors, Machine Learning, 24(2), 123-140, 1996.

[12] T.M. Mitchell, Machine Learning, McGraw Hill, USA. 1997.

[2] C.Blake, E. Keogh, C. J. Merz, UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. [3] C.Ferreira, Gene Expression Programming: a New Adaptive Algorithm for Solving Problems. Complex Systems, 13, 2 (2001), 87-129. [4] C.Ferreira, Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. 2nd Edition, Springer-Verlag, Germany, 2006. [5] Y. Freund, R. Scapire. Experiments with a new boosting algorithm. In Proceedings of the 13th Int. Conference on Machine Learning, 148-156, 1996. [6] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA, 2001. [7] Hitoshi Iba. Bagging, boosting, and bloating in genetic programming. In Proc. Of the Genetic and Evolutionary Computation Conference GECCO99, 1053-1060, Orlando, Florida, July 1999. Morgan Kaufmann. [8] J. R. Koza, Genetic Programming. Cambridge, MA: MIT Press, 1992.

Third International Conference on Natural Computation (ICNC 2007) 0-7695-2875-9/07 $25.00 © 2007

[13] J. R. Quinlan, Bagging, boosting, and C4.5, in Proc. 13th Nat. Conf. Artif. Intell., 725-730, 1996. [14] W.R.Weinert, H.S. Lopes, GEPCLASS: a classification rule discovery tool using gene expression programming. in Proceedings of the 2nd International Conference on Advanced Data Mining and Applications,871880, 2006. [15] I. H. Witten, Eibe Frank, Data Mining Practical Machine Learning Tools and Techniques with java Implementations, Morgan Kaufmann Publishers, USA, 2000. [16] B.-T. Zhang and H. Muhlenbein,Balancing accuracy and parsimony in genetic programming, Evol. Comput., vol. 3, no. 1, 17-38, 1995. [17] C.Zhou, W.Xiao, P. C. Nelson, and T. M.Tirpak, Evolving Accurate and Compact Classification Rules with Gene Expression Programming. IEEE Transactions on Evolutionary Computation, 7, 6 (2003), 519531. [18] J.Zuo, C.Tang, C. Li, A. Chen, and C. Yuan, Time Series Prediction Based on Gene Expression Programming. In Proceedings of the Fifth International Conference on Web-Age Information Management, Dalian, China, 2004.

Suggest Documents