rithms to evolve a set of classifiers (or rules) for pattern classification tasks. ..... J. H. Holland, L. B. Booker, M. Colombetti, M. Dorigo, D. E. Goldberg, S. Forrest,.
CoXCS: a coevolutionary learning classifier based on feature space partitioning Mani Abedini and Michael Kirley Department of Computer Science and Software Engineering The University of Melbourne, Australia {mabedini, mkirley}@csse.unimelb.edu.au
Abstract. Learning classifier systems (LCSs) are a machine learning technique, which combine reinforcement learning and evolutionary algorithms to evolve a set of classifiers (or rules) for pattern classification tasks. Despite promising performance across a variety of data sets, the performance of LCS is often degraded when data sets of high dimensionality and relatively few instances are encountered, a common occurrence with gene expression data. In this paper, we propose a number of extensions to XCS, a widely used accuracy-based LCS, to tackle such problems. Our model, CoXCS, is a coevolutionary multi-population XCS. Isolated sub-populations evolve a set of classifiers based on a partitioning of the feature space in the data. Modifications to the base XCS framework are introduced including an algorithm to create the match set and a specialized crossover operator. Experimental results show that the accuracy of the proposed model is significantly better than other well-known classifiers when the ratio of data features to samples is extremely large.
1
Introduction
Learning Classier Systems (LCSs) are a genetic-based machine learning technique used to solve pattern classification problems [11, 13, 2, 8]. XCS, a wellknown Michigan-style model, evolves problem solutions represented by a population of classifiers [19, 20]. Each classifier consists of a condition-action-prediction rule, with a fitness value proportional to the accuracy of the prediction of the reward. Evolutionary operators are used to discover better rules that may improve the current population of classifiers. Consequently, XCS is generally able to cover the state space more efficiently than other LCS. Although there are many papers reporting high accuracy results across a wide spectrum of classification tasks, large state spaces and relatively small sample sizes (a common occurrence with gene expression data [22]) often lead to the evolution of partly-overlapping rules resulting in lower XCS accuracy [15]. Butz and co-workers [3] have shown that by introducing techniques that can efficiently detect building blocks in the condition part of the classifier, it may be possible to improve the performance of XCS. Specific evolutionary operators designed to help avoid the over-generalization phenomena inherent in XCS have also been demonstrated to be useful [14]. However, there is room to further extend these
ideas and introduce modifications/enhancements enabling XCS-based models to solve gene expression classification problems. A natural way to tackle high dimensional search problems is to adopt a “divide-and-conquer” strategy. To the best of our knowledge, decomposition approaches for XCS has been limited to the models proposed by Gershoff [7] and Richter [17]. Significantly, both of these papers report improved performance when the decomposition approach was used. A cooperative coevolutionary framework [16] may also provide a suitable approach for classification tasks. Zhu and Guan [22] report competitive performance results using a cooperative coevolution LCS. However, if all features are used in the classification process, the excessive computational cost reduces the efficiency/effectiveness of the model. Feature selection provides an alternative approach to help deal with high dimensional data. For gene expression data, techniques that rank genes according to their differential expressions among phenotypes, or techniques based on information gain ranking and principal component analysis can be used [21]. In this paper, we propose a number of extensions to the XCS to solve complex classification tasks. Our model, CoXCS, is fundamentally a coevolutionary model. Here, a number of isolated sub-populations are used to evolve classifiers based on a partitioning of the feature (or attribute) space. A modified version of XCS is used in each of the sub-populations. We introduce a specialization technique for reducing the number of attributes activate during the learning phase and a specificity crossover operator. Detailed computational experiments using a suite of benchmark data sets clearly shows that proposed model is comparable with other classification techniques. Significantly, the performance of the proposed model is better than other models when the ratio of data features to samples is extremely large. The remainder of this paper is organized as follows: In Section 2 we present background material related to XCS and multi-population implementations. In Section 3 our model is described in detail. This is followed by a list of the experiments and results. We conclude the paper in Section 5 with a discussion of the results and identify future research directions.
2 2.1
Background and related work XCS overview
XCS is widely accepted as one of the most reliable learning classifier system for data mining. We provide a brief overview of XCS functionality in this subsection. Space constraints preclude us from providing a detailed discussion of XCS. However, further details can be found in Wilson’s original paper [19] and related papers (eg.[15, 20, 4]). XCS maintains a population of classifiers (see Fig 1). Each classifier consists of a condition-action-prediction rule, which maps input features to the output signal (or class). A ternary representation of the form 0,1,# (where # is don’t care) for the condition and 0,1 for the action can be used. In addition, real encoding can also be used to accurately describe the environment states [20]. 2
Fig. 1. XCS model overview. The condition segment of the classifier consists of a vector of features, each encoded using real or binary values. The output signal (prediction class) is a binary value in this case. The classifier’s fitness value is proportional to the accuracy of the prediction of the reward. See text for further explanation.
Input, in the form of data instances (a vector of features), is passed to the XCS. A match set [M ] is created consisting of rules (classifiers) that can be “triggered” by the given data instance. A covering operator is used to create new matching classifiers when [M ] is empty. A prediction array is calculated for [M ] that contains an estimation of the corresponding rewards for each of the possible actions. Based on the values in the prediction array, an action, a (the output signal), is selected. In response to a, the reinforcement mechanism is invoked and the prediction, p, prediction error, , accuracy, k, and fitness, F , of the classifier is updated via the following equations: p ← p + β(R − p)
← + β(|R − p| − )
and
where β is the learning rate (0 < β < 1). The classifier accuracy is calculated from the following equations: { 0 k 1 if < 0 k= and k = ∑ α( 0 )−ν otherwise kx x∈[A]
Finally, the classifier fitness, F , is updated using the relative accuracy value: 0
F ← F + β(k − F ) It is important to note that the classifier fitness is updated based on the accuracy of the actual reward prediction. This accuracy-based fitness provides a mechanism for XCS to build a complete action map of the inputs space. 3
A key component of XCS, is the evolutionary computation module. During the evolutionary process, fitness-proportionate selection is used to guide the selection of parents (classifiers in the population), who generate new offspring via crossover and mutation. A bounded population size is typically used. Consequently a form of niching is used to determine if the offspring is added to the population and/or which of the old members of the population are deleted to make room for the new classifier (offspring). The deletion of classifiers is biased towards those with larger action set sizes and lower fitness. 2.2
Multi-population XCS models
Dam et al., [6] proposed an XCS–based client/server distributed data mining system. Each client had its own XCS, which evolved a set of classifiers using a local data repository. The server then combined the models with its own XCS and attempted to find a set of classifiers to help explain patterns incorrectly classified locally by the clients. The performance of the model was evaluated using benchmark problems, focussing on network load and communication costs. The results suggested that the distributed XCS model was competitive as a distributed data mining system, particularly when the epoch size increased. In a similar study, a multi-population parallel XCS for classification of electroencephalographic signals was introduced by Skinner et al., [18]. The specific focus of that study was to investigate the effectiveness of migration strategies between sub-populations mapped to ring topologies. They reported that the parameter setting of the multi-population model had a significant effect on the resulting classifier accuracy. An alternative approach for solving a classification task is to incorporate a decomposition strategy into the model. For example, Gershoff et al., [7] attempted to improve global XCS performance via a hierarchical partitioning scheme. An agent in the model was assigned to each partition, which contained a collection of homogeneous XCS classifiers. The predicted output signal (class) was then estimated using a voting mechanism. This output signal, with a confidence score, was then passed up the hierarchy to a controlling agent. This agent then decided the final output of the system based on the combined output from each of the sub-populations it was responsible for. Gershoff et al., report results with improved performance notes in the limited domain tested. Richter [17] introduced an extended XCS model, where a series of lower level problems were solved. These results were then combined into a global result for the given problem. Improved performance was noted in a limited range of test problem used in the study. In such an approach, different sub-problem formulations will have a significant impact on the performance of the distributed system. A recent model employing a cooperative coevolutionary classifier system was introduced by Zhu and Guan [22]. In this fine-grained approach, individuals in isolated sub-populations encoded if–then rules for each feature in the data set. As such, the decomposition was taken to the extreme. Individuals were used to classify the partially masked training data corresponding to the feature in focus. 4
However, this particular approach required a two-step process – a concurrent global and local evolutionary process – in order to generate satisfactory accuracy levels. For data sets with a large number of features (attributes) such fine-grain modelling is computationally expensive.
3
Model
CoXCS is fundamentally a coevolutionary parallel learning classifier based on feature space partitioning. Fig. 2 provides a high-level schematic overview of the system. Within the CoXCS model, the features contained in the data set are partitioned into a set of n sub-populations. The condition segment of each classifier in a given sub-population is initialized using this fixed subset of the features (of size λ) from the data set being processed. Importantly, the sub-populations evolve separately, but with a common objective. As such, each isolated suppopulation accumulates and specializes its expertise across a subset of the input space. Bounded sub-population sizes are used as per the standard XCS model. When a new classifier is added to a sub-population, if the size limit is reached, a randomly selected classifier (based on a niching technique) is deleted from the sub-population. Migration episodes are also used to exchange classifiers between sub populations. After a fixed number of iterations, randomly selected classifiers migrate to a different sub-population based on a random migration topology. It is important to note, that the mutation operator does not destroy the inherent building blocks within the immigrant classifiers. Two important modifications are proposed for the XCS model running in each of the sub-populations: Firstly, a new covering operator is used to create the match set [M ] (see Algorithm 1). This operator builds single feature classifiers (the remaining fea-
Fig. 2. High level overview of the CoXCS model. Each isolated sub-population evolves solutions based on a partitioning of the feature space (of size λ) using a separate CoXCS. Randomly selected classifiers migrate between sub-populations.
5
Algorithm 1 CreateMatchSet() Require: input: a vector of features X ∈ {x0 , x1 , . . . , xnλ } action: a value for the expected output/class (eg. 0 or 1) s, e: array index addresses – start (eg. 0) and end (eg. λ) 1: initialize match set [M ] 2: for i = s to e do 3: create new classifier rule 4: rule.setCondition(i,input[i]) 5: rule.setAction(action) 6: [M ].add(rule) 7: end for 8: return [M ]
Algorithm 2 SpecCrossover() Require: p1, p2: two randomly selected parents (classifiers) len: is the length of condition segment in parents (len = λ) 1: create new classifier child 2: for i = 0 to len do 3: if p1.hasCondition(i) AND ! p2.hasCondition(i) then 4: child.setCondition(i, p1.getCondition[i]) 5: else if ! p1.hasCondition(i) AND p2.hasCondition(i) then 6: child.setCondition(i, p2.getCondition[i]) 7: else if p1.hasCondition(i) AND p2.hasCondition(i) then 8: ∆ ← p1.getCondition[i] ∩ p2.getCondition[i] 9: if ∆ 6= null then 10: child.setCondition(i, ∆) 11: end if 12: end if 13: end for 14: return child (the new classifier)
tures are set to #) for each of the features present in the given partition. An important distinction between our model and XCS is the fact that we create λ classifiers, which are added to the bounded population of each partition. In contrast, XCS would create only one classifier. This approach, allows the evolutionary search to slowly build up more specialized classifiers via the genetic operators and reinforcement learning mechanism. Secondly, we introduce a specialized crossover operator, which generates valid offspring (classifiers) across the range of feature encodings used (see Algorithm 2). In the case of nominal and binary features, if the feature appears in either parent, the feature is copied to the child. For real value features, the centerspread and range are examined. The corresponding common range of the feature in both parents is then copied to the child. A standard mutation operator is then applied to the child. In our implementation, we use a variable length hybrid real-integer encoding for each classifier. A sparse vector representation, indexed to the feature value, 6
is used. The # values are not stored. This approach is used to both speed up computations and minimize memory use. In addition, this approach provides a flexible means to concatenate classifiers generated from different populations (initial partitions of the feature space) after migration episodes and crossover operations. The predicted output class for the classification task is based on majority voting among all CoXCS predictions.
4
Experiments
A series of experiments were conducted to validate our approach. The underlying hypothesis tested was that the classification using the CoXCS model would lead to improved accuracy, particularly for high dimensional data sets. 4.1
Data sets
A range of data sets displaying different characteristics were used for evaluation. wbc (Breast Cancer Wisconsin - Original), wpbc (Breast Cancer Wisconsin Prognostic), wdbc (Breast Cancer Wisconsin - Diagnostic) and hepatitis were taken from the UC Irvine Machine Learning Repository [1]. Two gene expression datasets were also included: BRCA (sporadic breast cancer gene profiles) [10, 12] and Prostate (prostate cancer gene profiles) [12]. 4.2
Methodology
Model parameters For all experiments, a hybrid feature encoding scheme was used. The parameter settings for our modified XCS were based on the default XCS settings recommended in [5]. The parameter values that were different include: population sizes of 3000 (UCI data sets) or 5000 (gene expression data sets); the exploration/exploitation rate was set to 0.3, and the reward value set to 1000. The partitioning schemes used was a simply equal linear division of the feature space. In this study, we have employed a simple rule: the number of partitions (and thus sub-populations) n = b0.1× #Featuresc for a given data set. The migration ratio was set to 10% of the population size. Five separate migration stages were used, where the number of iterations between migration episodes was fixed at 100. Validation and performance measures Ten-fold cross validation was used for the data sets taken from the UCI Repository. The small number of instances (samples) of the gene expression data sets restricted evaluation to two-fold cross validation. In order to compare the performance of our model with other classifier systems, we report results based on the Area Under Curve (AUC) of the Receiver Operating Characteristic (ROC), a widely used technique in machine learning [9]. AUC values vary between 0 and 1, where 0.5 represents a random classification and 1 represents the highest accuracy. 7
Table 1. Data set details. The gene expression data sets are characterized by a small number of instances and a very large number of features. All data sets have two output classes. Data Set wbc wpbc wdbc hepatits BRCA Prostate
4.3
#Instance #Features %Majority %Missing 699 9 0.65 0.23 198 33 0.76 0.06 569 30 0.62 – 155 19 0.85 5.30 22 3226 0.68 – 21 12600 0.61 –
Results
Table 2 lists the AUC results for each of the data sets considered for a variety of different classifiers. The non-XCS classifier results were generated using the Weka package. The relative performance of the base-line XCS and the other classifiers was very similar. It is interesting to note that accuracy levels were generally very low for the gene expression data sets (BRCA and Prostate). The accuracy performance of the CoXCS was generally better than other classifiers across data sets. CoXCS performance was significantly better (p < 0.01, 15 trials) for problems where the ratio of the number of features to instance was extremely large.
5
Discussion and conclusion
There are many examples reported in the literature illustrating the effectiveness of LCS, and the accuracy-based XCS in particular, for data mining task. However, there are still many open questions related to improving classification accuracy when confronted with problems of high dimensionality, a small number of data instances, noisy data and multiple classes. In this paper, we have proposed enhancements for XCS to improve classification accuracy for data sets where the ratio of data features to samples is extremely large in binary classification tasks. In CoXCS, isolated sub-populations were used to evolve classifiers based on a initialization mechanism using a subset of features. Two modifications were made to the base XCS model running in each island: a new algorithm was used to create the match set and a specialized crossover operator was used. This “divide-and-conquer” strategy encourages the evolution of specialized classifiers and allows us to maximize the advantages of the embedded reinforcement learning mechanism in XCS. Detailed experimental studies show that CoXCS is comparable with, and outpeforms other well-known classifiers in many cases, across the suite of benchmark data sets used for evaluation. The results suggest that the decomposition strategy plays an important role in guiding the trajectory of the evolving populations. Here, we have limited the decomposition to a naive approach. In future work, 8
Table 2. AUC results. Bold values indicate the the CoXCS model was significantly better when compared to all of the other classifiers.
Classifier
Mode Train j48 Test Train NBTree Test Train Random Forest Test Train Neural Networks Test Train Logistic Regression Test Train Naive Bayes Classifier Test Train SVM Test Train XCS Test Train CoXCS Test
wbc 0.98 0.95 0.99 0.98 1.00 0.98 0.99 0.98 0.99 0.99 0.98 0.98 0.97 0.96 0.99 0.97 0.99 1.00
wdbc 0.99 0.93 0.99 0.95 1.00 0.98 0.99 0.99 1.00 0.97 0.98 0.98 0.93 0.93 0.99 0.93 1.00 0.99
wpbc hepatitis BRCA Prostate 0.93 0.91 0.92 1.00 0.59 0.70 0.35 0.42 0.79 0.97 1.00 1.00 0.55 0.81 0.45 0.46 1.00 1.00 1.00 1.00 0.63 0.84 0.29 0.33 0.98 0.94 0.50 0.50 0.68 0.81 0.50 0.50 0.94 0.94 1.00 0.50 0.77 0.80 0.56 0.50 0.72 0.91 0.99 1.00 0.64 0.83 0.50 0.35 0.50 0.54 1.00 1.00 0.50 0.51 0.53 0.38 0.97 1.00 0.50 0.50 0.70 0.72 0.50 0.50 1.00 0.97 1.00 1.00 0.98 0.96 0.80 0.75
it would be interesting to examine alternative techniques to detect variable interactions that exist in a problem, and subsequently make use of this “expert knowledge” when partitioning the feature space. There is also scope to examine the effectiveness of distributed deployment and alternative migration policies using a suite of micro-array data classification problems.
References 1. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/. 2. L. Bull and T. Kovacs, editors. Foundations of Learning Classifier Systems, volume 183 of Studies in Fuzziness and Soft Computing. Springer, 2005. 3. M. Butz, M. Pelikan, X. Lloral, and D. E. Goldberg. Automated global structure extraction for effective local building block processing in XCS. Evolutionary Computation, 14(3):345–380, 2006. 4. M. V. Butz, T. Kovacs, P. L. Lanzi, and S. W. Wilson. Toward a theory of generalization and learning in XCS. Evolutionary Computation, IEEE Transactions on, 8(1):28–46, 2004. 5. M. V. Butz and S. W. Wilson. An Algorithmic Description of XCS. In P. L. Lanzi, W. Stolzmann, and S. W. Wilson, editors, Advances in Learning Classifier Systems, volume 1996/2001 of Lecture Notes in Computer Science, pages 267–274, Berlin, 2001. Springer Berlin / Heidelberg. 6. H. H. Dam, H. A. Abbass, and C. Lokan. DXCS: an XCS system for distributed data mining. In Proceedings of the 2005 conference on Genetic and evolutionary computation (GECCO–05), pages 1883–1890. ACM Press, 2005.
9
7. M. Gershoff and S. Schulenburg. Collective behavior based hierarchical XCS. In Proceedings of the 2007 Genetic And Evolutionary Computation Conference (GECCO–07), pages 2695–2700. ACM Press, 2007. 8. D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, Mass., 1989. 9. J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36, April 1982. 10. I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O. P. Kallioniemi, B. Wilfond, A. Borg, and J. Trent. Gene-Expression profiles in hereditary breast cancer. N Engl J Med, 344(8):539–548, February 2001. 11. J. H. Holland, L. B. Booker, M. Colombetti, M. Dorigo, D. E. Goldberg, S. Forrest, R. L. Riolo, R. E. Smith, P. L. Lanzi, W. Stolzmann, and S. W. Wilson. What is a Learning Classifier System? In P. L. Lanzi, W. Stolzmann, and S. W. Wilson, editors, Learning Classifier Systems. From Foundations to Applications, volume 1813 of LNAI, pages 3–32, Berlin, 2000. Springer-Verlag. 12. M. M. Hossain, M. R. Hassan, and J. Bailey. ROC-tree: A Novel Decision Tree Induction Algorithm Based on Receiver Operating Characteristics to Classify Gene Expression Data. In Proceedings of the SIAM International Conference on Data Mining, pages 455–465, Atlanta, Georgia, USA, April 2008. 13. T. Kovacs. Two views of classifier systems. In P. L. Lanzi, W. Stolzmann, and S. W. Wilson, editors, Advances in Learning Classifier Systems, volume 2321 of LNAI, pages 74–87. Springer-Verlag, Berlin, 2002. 14. P. L. Lanzi. A Study of the Generalization Capabilities of XCS. In T. B¨ ack, editor, Proceedings of the 7th International Conference on Genetic Algorithms, pages 418– 425. Morgan Kaufmann, 1997. 15. P. L. Lanzi, W. Stolzmann, and S. W. Wilson, editors. Learning Classifier Systems From Foundations to Applications, volume 1813 of Lecture Notes in Artificial Intelligence. Springer-Verlag, Berlin, 1st edition, 2000. 16. M. A. Potter and K. A. D. Jong. Cooperative Coevolution: An Architecture for Evolving Coadapted Subcomponents. Evolutionary Computation, 8(1):1–29, 2000. 17. U. Richter, H. Prothmann, and H. Schmeck. Improving XCS performance by distribution. In Simulated Evolution and Learning, 7th International Conference. Lecture Notes in Computer Science. 5361, pages 111–120. Springer, 2008. 18. B. Skinner, H. Nguyen, and D. Liu. Distributed classifier migration in XCS for classification of electroencephalographic signals. In 2007 IEEE Congress on Evolutionary Computation, pages 2829–2836. IEEE Press, 2007. 19. S. W. Wilson. Classifier Fitness Based on Accuracy. Evolutionary Computation, 3(2):149–175, 1995. http://prediction-dynamics.com/. 20. S. W. Wilson. Get Real! XCS with Continuous-Valued Inputs. In P. L. Lanzi, W. Stolzmann, and S. W. Wilson, editors, Learning Classifier Systems, From Foundations to Applications, volume 1813 of Lecture Notes in Computer Science, pages 209–222. Springer, 1999. 21. Y. Zhang and J. C. Rajapakse. Machine Learning in Bioinformatics. Wiley Series in Bioinformatics. 1’st edition, 2008. 22. F. Zhu and S. Guan. Cooperative co-evolution of GA-based classifiers based on input decomposition. Engineering Applications of Artificial Intelligence, 21:1360– 1369, 2008.
10