problems as a branch of Evolutionary Computation (EC). [1]. The schema theory [2] ... genetic parallel programming architecture had been proposed to construct ...
Condition Matrix Based Genetic Programming for Rule Learning Jin Feng Wang Kin Hong Lee Kwong Sak Leung Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, NT, Hong Kong SAR, China Email: {jfwang, ksleung, khlee}@cse.cuhk.edu.hk Abstract Most genetic programming paradigms are population-based and require huge amount of memory. In this paper, we review the Instruction Matrix based Genetic Programming which maintains all program components in a instruction matrix (IM) instead of manipulating a population of programs. A genetic program is extracted from the matrix just before it is being evaluated. After each evaluation, the fitness of the genetic program is propagated to its corresponding cells in the matrix. Then, we extend the instruction matrix to the condition matrix (CM) for generating rule base from datasets. CM keeps some of characteristics of IM and incorporates the information about rule learning. In the evolving process, we adopt an elitist idea to keep the better rules alive to the end. We consider that genetic selection maybe lead to the huge size of rule set, so the reduct theory borrowed from Rough Sets is used to cut the volume of rules and keep the same fitness as the original rule set. In experiments, we compare the performance of Condition Matrix for Rule Learning (CMRL) with other traditional algorithms. Results are presented in detail and the competitive advantage and drawbacks of CMRL are discussed.
1. Introduction Genetic Programming evolves programs to solve problems as a branch of Evolutionary Computation (EC) [1]. The schema theory [2] in Genetic Algorithm is very popular behind EC. It has been applied to GP [3] [4]. A new GP architecture [5] was proposed, i.e. the instruction matrix based GP (IMGP). It extracts the individuals from the instruction matrix, executes the traditional genetic operations, and updates the instruction matrix accordingly. So the population is concealed in the instruction matrix, which keeps the information of schema. In this paper, after reviewing IMGP, we extend the instruction matrix to the condition matrix for evolving the rules which are used to describe the existing data and
classify the unknown data. All these systems have common characters. They all operate on attribute-value databases, search for the rule with the highest accuracy and share a similar structure. Rule accuracy is the probability that a rule’s consequent is true given that its antecedents hold, which is the main fitness in our Condition Matrix based Genetic Programming. The general operations, such as crossover and mutate, was executed to update matrix in each generation in the new algorithm. Except for these operations, we apply the elitist algorithm to this architecture for keeping the better rules. The following sections are organized as follows. An introduction to rule learning and application of GP for classification is presented in section 2. The IMGP is described in section 3 which include the initialization of the matrix and genetic operators. A new algorithm---CMRL is proposed in section 4 as the extension of IMGP. Section 5 brings up with the improvement of CMRL. Section 6 presents the experimental results and the brief analysis. Finally, we include this paper with a summary of results and a discussion of the further research.
2. Background knowledge 2.1. Rule learning Rule learning is a well-developed area in machine learning for classification, which is a major task in data mining. Some rule learning algorithms have been modified and developed by many scholars, which learn rules from the observed data set to classify the unknown data. All algorithms have similar structure. The system of rule learning has input and output. The input is the training dataset, and the output is a rule set, which can best classify the training data with higher accuracy. The rule includes two components, the antecedent and the consequence, as follow: If A1 ∧ A2 ∧ ∧ An Then Class is Ci
where the rule includes n antecedents, ( A1 , A2 , , An ) and Ci is a certain class to which the object can be classified. Each antecedent present that one attribute takes one value. We need get a whole rule set, which can classify the data set with high accuracy that is the ratio of objects classified correctly in the dataset.
2.2. Application of Genetic Programming for classification Genetic Programming has been applied to rule learning and achieved some successes. One main advantage of Genetic Programming is the ability to construct the functional trees of variable length, which enables the search for very complex solutions. So complex intelligent structures, such as fuzzy rule-based systems or decision trees, have already become the output of genetic programming[6]. Genetic programming does not use local search, so it can produce solutions of high quality. But incorporating complex intelligent structures into genetic programming functional sets is not straightforward. A genetic parallel programming architecture had been proposed to construct classification programs for some binary-class data sets [7]. A hybrid GP-decision tree algorithm was designed to generate the individual node in the decision tree that had better results than canonical GP [8].
3. IMGP---Instruction matrix based Genetic Programming In IMGP, a fixed length expression to represent a program tree is used, a matrix instead of a population is employed, and programs are extracted from the matrix. The program uses a new representation, hs-expression, instead of the s-expression in the canonical GP[1]. All possible instructions of a program constitute the instruction matrix (IM) to encode the population in IMGP. The matrix height is equal to the length of the hs-expression and its width is the number of functions and terminals in a row. The related definitions can be viewed in [5].
3.1. Algorithm architecture First, the individual is extracted from the matrix and stored in the hs-expression. Individuals extracted from the instruction matrix are new and need to be evaluated. The fitness evaluation follows the standard methods of traversing the tree recursively in post-order. Then, new fitness is fed back to its instructions rather than the individual, so that each instruction can remember its own fitness that is used in extraction later. After extracting and evaluating new individuals, two new individuals are
generated by performing crossover on two individuals. These offspring are also evaluated, and their fitness is fed back to their instructions as their parents’ are. After evaluating a certain number of individuals, we shuffle each row in the matrix in turn. Finally, all new individuals are destroyed and the next generation begins.
3.2. Operators of IMGP for rule learning Operators used in IMGP are similar to those of the canonical tree-based GP, such as crossover and mutation. Three new operators are designed to handle new representations of the individual and the population, which include individual extraction, instruction evaluation and matrix shuffle. 3.2.1. Individual extraction. Before selecting the first node, an empty hs-expression filled with -1 is constructed and aligned with the matrix vertically. Extracting starts from the first row to determine the root that is also the first element in the hs-expression. Each node in the tree is extracted from the corresponding row in the matrix, and placed at the corresponding position in the hs-expression. Two cells are randomly taken from the same row, and one column label is put at the corresponding location in the hs-expression according to the fitness. A smaller fitness represents a better instruction. Therefore, the smaller the instruction’s fitness is, the more likely it is to be selected. 3.2.2. Instruction evaluation. As the new individual will be destroyed before the next round, the fitness cannot be assigned to the individual. Instead, it is fed back to its instruction so that the instruction remembers its historical fitness, which is used in the future extraction. During the feedback process, system will perform two more functions. 1). Average the new fitness with the instruction’s old fitness according to the following equation, in which the evaluation number is incremented by one. fitness =
old _ fitness * evaluation _ number + new _ fitness evaluation _ number + 1
2). If the new fitness is better than the instruction’s best fitness, the best fitness is updated. At the same time, the instruction’s left pointer and right pointer are exchanged in the current individual. This actually maintains good building sub-trees in the matrix, and they can be used to extract new individuals. 3.2.3. Matrix shuffle. In IMGP, new individuals are randomly extracted from the matrix, so the chance of combining good building blocks is rather slim. Therefore, we use matrix shuffle to spread good instructions so that they have sufficient opportunities to meet each other.
4. Condition matrix based GP for rule learning --- the extension of IMGP
n
the width of condition matrix = ∑ mk + m i =1
IMGP solves the two-class problem only when it is applied to classification problems. To relieve this restriction, we extend IMGP by using a condition-matrix based on Genetic Programming for rule learning. In this framework, a matrix is employed instead of a population and rules can be extracted from the matrix. It is an extension of IMGP for rule learning. There are several components of CMRL.
the height of condition matrix = n + 1
We take the ‘Saturday morning’s weather’ as a simple example which was firstly used by Quinlan [9]. This learning task need classify the activity on weekend into class P or class N according to the given weather conditions represented by four attributes shown as Table 1. Table 1.Dataset about Saturday morning’s weather
4.1. Condition matrix
No.
Similar to Instruction Matrix, a new matrix for generating rules is constructed. All cells are filled with conditions which have chance to be selected to become antecedents or conclusions of rules. Before establishing the condition matrix, we must firstly give out three assumptions as criteria to produce each cell of the condition matrix. 1. One rule is produced after each matrix search; 2. In each rule, every attribute presents one time at most; 3. Each condition of attributes that include the decision attribute has a chance to be selected for keeping the physical diversity. Based on the above assumptions, a given crisp environment is considered, so the size of each row equals to the number of the values of all attributes and classes, and the size of column equals to the number of attributes and classes. The width and height of the matrix can be fixed. We will introduce the algorithm using one example in the following subsections. We define that the dataset has n attributes, Α(1) , Α(2) , , Α(n) , to be selected. For each k(1 ≤ k ≤ n) , attribute A(k) takes mk linguistic values as Τ1(k) ,Τ 2(k) , ,Τ m(k) . Α(n + 1) represents the decision attribute k
which takes m values as Τ1(n + 1) ,Τ 2(n +1) , ,Τ m(n + 1) . So, the width of matrix is the total number of attribute’s values and classes and the height of matrix is the number of attributes added with 1 according to the assumption 3, i.e
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain
Attributes Temperature Humidity Hot High Hot High Hot High Mild High Cool Normal Cool Normal Cool Normal Mild High Cool Normal Mild Normal Mild Normal Mild High Hot Normal Mild High
Class Windy False True False False False True True False False False True True False True
We can see that all values of attributes are as follows from Table 1. Outlook = {Sunny, Overcast, Rain}; Temperature= {Hot, Cool, Normal}; Humidity= {High, Normal}; Windy= {True, False} Class= {P, N} According to the assumptions, a condition matrix can be constructed as Table 2 in which ‘O’=Outlook; ‘T’=Temperature; ‘H’=Humidity; ‘W’=Windy and ‘C’=Class. We can extract rules from Table 2 according to some criteria and transcribe all rules into one standard rule set as Figure 1.
Table 2. Condition matrix O=Sunny O=Sunny O=Sunny O=Sunny O=Sunny
O=Overcast O=Overcast O=Overcast O=Overcast O=Overcast
O=Rain O=Rain O=Rain O=Rain O=Rain
T=Hot T=Hot T=Hot T=Hot T=Hot
T=Mild T=Mild T=Mild T=Mild T=Mild
T=Cool T=Cool T=Cool T=Cool T=Cool
N N P P P N P N P P P P P N
H=High H=High H=High H=High H=High
.
H=Normal H=Normal H=Normal H=Normal H=Normal
W=True W=True W=True W=True W=True
W=False W=False W=False W=False W=False
C=P C=P C=P C=P C=P
C=N C=N C=N C=N C=N
O=Sunny
C=N
T=Hot
-1
H=High
-1
C=P
W=True
-1
-1
C=P
-1
-1
W=False …………
T=Cool
Transcribe Rule 1: If O=Sunny and T=Hot and H=High and W=True Then C=P; Rule 2: If any time Then C=N; …… Rule n: If W=False and T=Cool Then C=P.
Figure 1. Rules extracted from condition matrix
4.2. Fitness functions 4.2.1. Fitness of each condition. The fitness of each condition is used to determine whether it can be selected as antecedent. We design a function using information entropy borrowed from Information Theory [10] which is a measure of “degree of doubt”. The higher it is, the more doubt there is about the possible conclusion. Each attribute can classify the dataset with some information entropy. The entropy of the jth value of attribute A can be defined as the following formula. m
entropya j = - ∑ p(ci | a j )log2 p(ci | a j )
f1
i =1
4.2.2. Fitness of each rule. After the matrix is searched once, one rule can be extracted. More rules can be generated in a generation by repeating this operation. But not all rules are good enough to be kept to the next generation. Each rule has its contribution for the classification. We pursue the rule with short length scaled by the number of antecedents. We can take the following formula as the fitness to evaluate the quality of this rule. Fitness j =
factor j
lj
f2
lj: the length of the jth rule; factorj: the certainty factor [11] of the jth rule;
4.2.3. Fitness of each rule set. The aim of our algorithm is to find the optimal rule set with high performance using Genetic Programming. For knowing when the algorithm can be stopped, we need to design one criterion for evaluating the current rule set. So the classification accuracy is adopted as the fitness of rule set.
4.3. Operations 4.3.1. Extraction. Antecedents must be selected one by one from each row of matrix until the consequence is met.
Here, firstly, two conditions will be chosen at random from the same row as candidates. The antecedent is finally determined using binary tournament selection [12] that can maintain strong growth under normal conditions. According to the meaning of entropy, the lower it is, the better the condition is. So we select the condition with the lower entropy. If the selected element is an attribute condition, then the selection will continue in the next row of the matrix. The whole process can be concluded with two steps. • Step 1: Extract one condition from first row using binary tournament selection; • Step 2: Continue to select other antecedents of the rule from the successive rows until the consequence is selected. The second step is a dynamic process that means the training data changes as the rule is being produced. When an antecedent is determined, those data that do not have the same value of the attribute are deleted from the training data. 4.3.2. Crossover. Crossover in our system is executed between two rule sets. Two rule sets are extracted from the condition matrix and then crossover according to a specific crossover probability. Two rules are taken respectively from the rule sets and exchanged their position each other. The accuracy of rule set is used as the fitness to evaluate the performance of the current individual. If the accuracy is enhanced after crossover, we will accept this operation. The crossover process is shown as Figure 2. R1: {r1, r2, ……, ri,……,rm} Exchange R2: {l1, l2, ……, lj, ……, ln} R1’: { r1, r2, ……, lj ,……,rm } R2’: { l1, l2, ……, ri, ……, ln }
Figure 2. The crossover process 4.3.3. Matrix shuffle. Mutation in GA means that new individuals are produced. But in IMGP, individual are hidden in instruction matrix so that mutation can’t be executed on them directly. Something similar the matrix mutation is adopted, which can make good condition spread over matrix and have more enough opportunities to be selected. The following is a summary of the shuffle process. Firstly, we must define two thresholds, Diversity and Convergence. Diversity guarantees there are enough candidates to be taken for evolving, and convergence prevents the better condition appearing too many times which leads to premature convergence. Secondly, two conditions are randomly taken from the same row, and then the condition with lower entropy replaces the one
with higher entropy. The replacement can be normally realized only when two conditions are satisfied. i) The number of condition with higher entropy is larger than Diversity; ii) The number of condition with lower entropy is fewer than Convergence. Finally, the operations for all rows are repeated according to the mutation probability. 4.3.4. Elitist. A new rule is usually deleted after evaluation. However, in each generation, the best rules are regarded as elitists that are stronger than those surviving rules. We keep them alive in a rule pool using the elitist algorithm [13]. The elitist algorithm can promote the convergence of the whole evolving process and help to accelerate finding optimal results.
4.4. Implementation The overall algorithm of CMGP is illustrated in Figure 3. After constructing the condition matrix which is filled with all possible conditions, we initialize each cell using the fitness function f1. Then, in extraction phase, CMGP extracts rules repeatedly to generate one rule set. By repeating this operation, we can get many rule sets ready for the successive operations in each generation. According to the predefined probabilities, Crossover happens among rule sets for keeping evolution and Mutation happens in CM to control its diversity and convergence. At the same time, all rules are evaluated with the fitness function f2, and elitists will be put into a specific pool and kept alive to the end of evolution. The whole process in each generation is repeated until the output evaluated with the accuracy of the rule set is satisfying. All operations of evolving are implemented on the training data. Finally, the rule set generated from CM is assessed using testing data. Start Initialization Extraction
5. Optimizing algorithm using rough sets It is a well know problem that, Genetic Programming may suffer from the size problem during initialization [14]. Similar situation arises in CMRL because CMRL keeps extracting rules from the condition matrix randomly. The size of the rule set which contains the elitists is increased every time the rule extraction process is repeated. After many generations, it may become very large. That will definitely increase the complexity of space. To avoid this circumstance, we need to find a subset of the rule set which would ensure the same or higher classification accuracy. Rough sets, similarly to fuzzy sets, have been invented to cope with uncertainty. Optimization in a knowledge base consists in finding minimal sets of attributes that preserve the knowledge represented in the original information system [15]. We will introduce a rules pruning process based on the Rough Set theory to manage the rule set. Finding a minimal reduct is computationally an NP-hard problem as shown in [16] and [17].. The related knowledge is described as follows.
5.1. Rough sets theory In this section we introduce some of the basic formulation of rough sets theory and related concepts. More details can be found in [18]. An information system can be represented as S = (U, A ) where U is the universe of discourse with a finite number of objects, and A is a set of attributes defined on U. A decision system is a special information system ∆ = (U, A ∪ {d}) , in which a distinct attribute d is the decision attribute with corresponding decision value set Vd . For any subset of the attribute there exists one equivalence set B ⊆ A , relation I B = {( x , y ) : a( x ) = a( y ),∀a ∈ B,∃x ,∃y ∈ U} . The equivalence class of every object x related to I B can be defined by [x]B = {y : a(x) = a(y),∀a ∈ B } . For any subsets X ⊆ U and B ⊆ A , X can be approximated by a rough set consisting of a lower and an upper approximation defined respectively as: B X := {x : [x]B ⊆ X}
Evolution
B X := {x : [x]B ∩ X ≠ ∅ }
Crossove
False
Elitist
Shuffle
Condition True Testing End
Figure 3. Flow chart of implementation of CMRL
These rough set approximations form a pair of tight X based on B in the bounds over ~ inequalities BX ⊆ X ⊆ B X . The set B X = B X \ BX 1 contains objects that we cannot say for certain to be inside or outside of X given their values of attributes in B. In a decision system, the equivalence class generated by the decision attribute d is D = U/I{d} . As for any set X ⊆ U , 1 We use X\Y to represent difference of set X and Y.
each equivalence class in D can be approximated by using a given attribute set B. The corresponding quality of approximation by the set of attributes B can be measured by: ∑ X ∈ D BX γB (X) =
f3
X
Using the dataset in Table 1 as our example, after finishing evolving, we can get the original rule set as Figure 4 and the corresponding covering matrix as Table 3.
This measure constitutes the proportion of U which can be unambiguously classified using their attribute values in B. We can give the definition of reduct as follow. Definition 1: For any X ⊆ U , R is called as a reduct of A with respect to D if and only if γR (X) = γ A ( X ) and B ⊂ R ⇒ γB (X) < γ A (X) .
r1: If Outlook=Overcast Then Class=P r2: If Outlook=Sunny and Temperature=Mild and Humidity=High and Windy=False Then Class=N r3: If Windy=False and Outlook=Rain Then Class=P r4: If Windy=False and Outlook=Sunny and Humidity=Normal Then Class=P r5: If Windy=False and Outlook=Sunny and Temperature=Hot and Humidity=High Then Class=N r6: If Windy=True and Temperature=Mild and Outlook=Sunny and Humidity=Normal Then Class=P r7: If Outlook=Sunny and Temperature=Cool Then Class=P r8: If Outlook=Rain and Windy=True Then Class=N r9: If Humidity=Normal and Temperature=Hot Then Class=P
5.2. Reduction for rule set In our system, we have applied the method of finding the reduct of attributes to find the reduct of rules. All original rules can be treated as “attributes”. When the rule set R is determined with the classification accuracy for the training data, we may find a minimal subset R’ which can cover training data on the same degree as origin rule set. So the reduct of rule set is defined as follows. Definition 2: R’ is called as the rule-reduct of rule set R if and only if the quality of classification of R’ is equal to that of R,, i.e., accuracy (R' ) = accuracy(R) , and R' ' ⊂ R' ⇒ accuracy(R'' ) < accuracy(R' ) . To find a rule-reduct, we introduce a new concept, , rm } , and a Covering Matrix. A rule set R, {r1 , r2 , , xn } are given. training dataset X, {x1 , x2 , Definition 3: M m×n is a covering matrix of the rule set R on the object set X, if each element is 1; r can classify x j correctly; cij = i ri ∈ R; x j ∈ X 0; others
According to the formulation, the classification of R can be represented by: n
quality
of
m
∑ ( ∪ cij )
γR ( X ) =
j =1 i = 1
f4
n
where ∑ and ∪ are the common addition and logic addition respectively. The objective is to find a rule-reduct R’ of R which makes γR (X) = γ R (X) .
Figure 4. An original rule set From Table 3, we can get the quality of the rule set: 14
9
∑ ( ∪ cij )
γR ( X ) =
j =1 i = 1
14
14
∑(
γR ( X ) =
∪
j =1 i = 1,2,3,4,5,6,8
14
x2 0 0 0 0 0 0 0 0 0
x3 1 0 0 0 0 0 0 0 0
x4 0 0 1 0 0 0 0 0 0
x5 0 0 1 0 0 0 0 0 0
x6 0 0 0 0 0 0 0 1 0
x7 1 0 0 0 0 0 0 0 0
cij ) =
13 = 0.928 14
We can see γR (X) = γR (X) = 0.928 and R’ is the only subset which satisfies with this condition, so R’ is the reduct of the original rule set. We can get the corresponding covering matrix showed in Table 4. Obviously, it is a special case. Generally, a rule set may have several rule-reducts. But not all reducts can classify the testing set very well due to over-fitting. So we must select the best one among them to avoid to over-fitting. '
Table 3. Covering matrix of original rule set x1 0 0 0 0 1 0 0 0 0
13 = 0.928 14
We can use some methods to search all rules and find a subset R’ of rule set which has the same covering. We can compute the quality of R’ according to formulation.
'
r1 r2 r3 r4 r5 r6 r7 r8 r9
=
x8 0 1 0 0 0 0 0 0 0
x9 0 0 0 1 0 0 1 0 0
x10 0 0 1 0 0 0 0 0 0
x11 0 0 0 0 0 1 0 0 0
x12 1 0 0 0 0 0 0 0 0
x13 1 0 0 0 0 0 0 0 1
x14 0 0 0 0 0 0 0 1 0
Table 4. Covering matrix of rule-reduct r1 r2 r3 r4 r5 r6 r8
x1 0 0 0 0 1 0 0
x2 0 0 0 0 0 0 0
x3 1 0 0 0 0 0 0
x4 0 0 1 0 0 0 0
x5 0 0 1 0 0 0 0
x6 0 0 0 0 0 0 1
x7 1 0 0 0 0 0 0
5.3. Further optimization We propose a solution for seeking the optimal one from those rule-reducts. The main idea is to ensure that the rule-reduct keeps the rules according to their respective contributions to the classification on the training set. The contribution is evaluated by the fitness function f2. The higher the fitness of a candidate is, the higher its priority to be selected is. The algorithm is as followed. Step 1: Sort all the rules in the origin rule set R according to their fitness, and getting a permutation of R in descending order, i.e., R*={ r1*, r2*,……, rm*}; Step 2: Construct a covering matrix M* for R*, and initialize the reduct R' = ∅ and the index i=1; Step 3: Put ri* into R’; Step 4: Evaluate the quality of R’. If γR (X) = γR (X) , R’ is regarded as the optimal rule-reduct and the process stops here; else i=i+1 and go to step 3. '
6. Experimental results and analysis We have implemented our algorithm in C++ and tested it on some benchmark problems with the same parameters and conditions. Here, the crossover rate is set to be 0.9; and the thresholds of diversity and convergence are set to be 1.0 and half of the width of CM respectively. Each data set is separated into a training set and a testing set. We adopt 5-fold method to make sure that the testing data can cover the whole dataset. After five iterations, all data are used for testing and the average and standard deviation can be computed to evaluate the performance of the classifier. All the datasets used in the experiments are from the UCI repository [19]. Our first experiment is to compare the results from Condition Matrix based GP for rule learning with and without the enhancement proposed in section 5. The related results are shown in Table 6. Italic style indicates the testing accuracy that is not lower than the training accuracy and Bold style highlights the optimal results of the accuracy or the size of the rule set. Obviously, the size of rule set is decreased significantly by incorporating the
x8 0 1 0 0 0 0 0
x9 0 0 0 1 0 0 0
x10 0 0 1 0 0 0 0
x11 0 0 0 0 0 1 0
x12 1 0 0 0 0 0 0
x13 1 0 0 0 0 0 0
x14 0 0 0 0 0 0 1
reduct technique in the original CMRL. This partly explains why the majority of the datasets have higher testing accuracies than the training accuracies in which we mainly focus on the results without reduct. It means our algorithm has avoided over-fitting effectively. Furthermore, the testing accuracy of the improved version keeps the same level as the original one generally. We have also observed that the classification accuracy is higher than the original one occasionally. This shows that the improvement in our algorithm not only reduces the size of rule set, but ensures that the performance would not be dropped and even increased sometimes. The following tables 7&8 show the comparison between our algorithm and other algorithms that include some traditional classifiers and the classifiers using G3P [20]. Although straightforward comparisons are not always efficient, these tables provide a useful way to extract valuable conclusions on the capabilities of the analyzed system. We observe that our algorithm have better performance on some specific datasets. It is obvious that CMRL still has some distinct advantages to be exploited and great potential to be improved. In GP for rule learning, CMRL with reducts presents an obvious decisive advantage. Table 6. Comparative results with and without improvement
Datasets Monk1 Monk2 Monk3 Pima Zoo Horse Flare Cancer
CMRL Original version Improved version Training Testing SE Rules Testing SE Rules 0.985 0.99 0.002 41.5 0.986 0.048 38.5 0.6572 0.6573 0.005 85.8 0.6573 0.048 57 0.986 0.958 0.002 20.8 0.964 0.034 20.8 0.754 0.754 0 32 0.754 0 6 0.982 0.954 0 36.2 0.936 0 7.6 0.79 0.774 0.023 38.8 0.772 0.05 13.8 0.718 0.64 0.016 52.5 0.64 0.07 23 0.862 0.863 0.04 26 0.863 0.071 18.8
Table 7. Comparative results for the pima Algorithm CMRL G3P-DT G3P-FRBS G3P-ANN G3P-FPN BackProp Cal5 Cart Ac2 NewId Cn2
Train 0.754 0.79 0.827 0.767 0.764 0.802 0.768 0.773 100 100 0.99
Test 0.754 0.712 0.75 0.754 0.732 0.752 0.75 0.745 0.724 0.711 0.711
Algorithm Castle QuaDisc Bayes C4.5 IndCart BayTree LVQ Kohonen Alloc80 KNN Default
Train 0.74 0.763 0.761 0.869 0.921 0.992 0.899 0.866 0.712 100 0.65
Test 0.742 0.748 0.738 0.73 0.729 0.729 0.728 0.727 0.699 0.676 0.65
Table 8. Comparative results for the horse Algorithm CMRL G3P-FRB G3P-ANN G3P-FPN
Train 0.79 0.831 0.776 0.82
Test 0.774 0.652 0.67 0.677
Algorithm G3P-DT ID3 R- C4.5 A- C4.5
Train 0.787 N/A N/A N/A
Test 0.685 0.783 0.844 0.818
7. Conclusions Genetic programming can be used for classifying. IMGP is proposed as a new architecture of GP that encloses the population in the matrix. In this paper, we have applied the instruction matrix to rule learning and proposed a new framework---CMRL. Condition matrix is used for extracting rules using the operations similar to those in GA. After evolving, we can get a better rule set as the output of GP. Compared with existing algorithms, CMRL performs better than others. It inherits the characteristics of IMGP and has the extensive power on multi-class problems. We can summarize several contributions as follows. • A new framework for Rule Learning --- an extension of IMGP, has been proposed; • Rough Sets theory has been used to find rule-reduct, which can get a smaller rule set than the original one while maintaining its classification accuracy; • CMRL is better than the traditional algorithms at accuracy on some benchmark datasets. In the future work, we need to improve this framework by adjusting fitness function and changing the parameters. We envisage that the size of rule set can be reduced further and the speed would also be enhanced.
8.
Acknowledgment
The work described in this paper was partially supported by two grants from the Research Grants Council of the Hong Kong Special Administrative Region, (Project No. CUHK4192/03E and China CUHK4132/05E).
References [1] Koza, J.R., Genetic Programming--- On the Programming of Computes by Means of Natural Selection, MIT Press, Cambridge, MA, USA, 1992. [2] Holland, J.H., Adaptations in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, Michigan , 1975. [3] Rosca, J.P., “Analysis of complexity drifts in genetic programming”, Proceedings of the Second Annual Conference, Stanford University, CA, USA, 1997, pp. 286–294. [4] Langdon, W.B., Poli, Foundations of Genetic Programming, Springer-Verlag, 2002. [5] Gang Li, Kin Hong Lee, and Kwong Sak Leung, “Evolve Schema Directly Using Instruction Matrix Based Genetic Programming”, in Proceedings of EuroGP 2005, Springer-Verlag Berlin Heidelberg, 2005, pp. 271–280. [6] A. Tsakonas, G. Dounias, H. Axer, D.G. von Keyserlingk, “Data Classification using Fuzzy Rule-Based Systems represented as Genetic Programming Type-Constrained Trees", in Proceedings of the Workshop of Computational Intelligence, University of Edinbourgh, 2001, pp. 162-168. [7] S. M. Cheang, K. H. Lee, and K. S. Leung, “Data classification using genetic parallel programming”, in Proceedings of GECCO-2003, Springer-Verlag, Chicago, 12-16 July 2003, pp. 1918–1919. [8] R. E. Marmelstein and G. B. Lamont, “Pattern classification using a hybrid genetic program decision tree approach”, in Proceedings of the Third Annual Conference, University of Wisconsin, Madison, USA, 22-25 July 1998, pp 223–231. [9] J.R. Quinlan, “Induction of decision trees”, Mach.Learn., 1986, vol.1, pp.81-106. [10] R. M. GRAY, Entropy and Information Theory, Springer-Verlag, New York, 1990. [11] Buchanan, G., and E.H. Shortliffe, e., Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley Publishing Co, 1984. [12] Brindle, A., Genetic algorithms for function optimization, University of Alberta, Department of Computer Science, 1981. [13] D Bhandari, CA Murthy, Genetic Algorithm with Elitist Model and Its Convergence, SK Pal - IJPRAI, 1996. [14] A Ratle, M. Sebag, “Genetic programming and domain knowledge: beyond the limitations of grammar-guided machine discovery”, Proceedings of Nature-PPSN VI, Springer-Verlag, Paris, 2000, pp. 211-220. [15] Lech Polkowski, Rough Sets: Mathematical Foundations, Physica-Verlag Heidelberg, 2002. [16] A. Skowron, The implementation of algorithms based on discerbility matrix, manuscript, 1989. [17] A. Skowron and C. Rauszer, “The discernibility matrices and functions in information systems”, Intelligent Decision Support, Kluwer, Dordrecht, 1992, pp.311-362. [18] J. Komorowski, L. Polkowski, and A. Skowron, "Rough sets: a tutorial", Rough-Fuzzy Hybridization: A New Method for Decision Making, Springer-Verlag, 1998. [19] C. Merz and P. Murphy. “UCI repository of machine learning databases ”.[Online].Available:ftp//ftp.ics.uc i.edu/pub/machine-learning-databases, 1996. [20] A. Tsakonas, "A comparison of classification accuracy of four genetic programming-evolved intelligent structures”, Information Sciences, Volume 176, Issue 6, 2006, pp. 691-724.