June 15, 2017
9:14
IJAIT
S0218213017500063
International Journal on Artificial Intelligence Tools Vol. 26, No. 3 (2017) 1750006 (26 pages) c World Scientific Publishing Company DOI: 10.1142/S0218213017500063
Rule Extraction from Training Data Using Neural Network Saroj Kumar Biswas∗ , Manomita Chakraborty† , Biswajit Purkayastha‡ , Pinki Roy§ and Dalton Meitei Thounaojam¶ Computer Science and Engineering Department National Institute of Technology Silchar-788010, Assam, India ∗
[email protected] †
[email protected] ‡
[email protected] §
[email protected] ¶
[email protected] Received 21 August 2015 Accepted 17 November 2016 Published 23 June 2017 Data Mining is a powerful technology to help organization to concentrate on most important data by extracting useful information from large database. One of the most commonly used techniques in data mining is Artificial Neural Network due to its high performance in many application domains. Despite many advantages of Artificial Neural Network, one of its main drawbacks is its inherent black box nature which is the main problem of using Artificial Neural Network in data mining. Therefore, this paper proposes a rule extraction algorithm from neural network using classified and misclassified data to convert the black box nature of Artificial Neural Network into a white box. The proposed algorithm is a modification of the existing algorithm, Rule Extraction by Reverse Engineering (RxREN). The proposed algorithm extracts rules from trained neural network for datasets with mixed mode attributes using pedagogical approach. The proposed algorithm uses both classified as well as misclassified data to find out the data ranges of significant attributes in respective classes, which is the innovation of the proposed algorithm. The experimental results clearly show that the performance of the proposed algorithm is superior to existing algorithms. Keywords: Data mining; artificial neural networks; rule extraction; pedagogical; RxREN algorithm; classification.
1. Introduction With the advent of technology everyday huge amount of data are collected and stored in database. All these raw data are converted to useful information using data mining techniques. Data Mining is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships between data components. Researches in this area are growing as amount of data collected are increasing and hence, various issues related to handling large volume data are arising.1 Database researchers are more 1750006-1
page 1
S. K. Biswas et al.
concerned about efficient retrieval of hidden information (patterns) from large database using available database technology. Data Mining is the most commonly used knowledge acquisition technique for knowledge discovery. There are many data mining tasks and classification is one of them.2 Many algorithms for classification have been designed, which enable the researchers to solve their problem in various domains. Recently, there has been a trend of using artificial neural networks (ANNs) for data mining tasks.3 ANNs are commonly used technique for classification of data with mixed mode attributes,4 as they achieve high classification accuracy on huge volume of data with low computational cost.5 It is accepted that an Artificial Neural Network (ANN) provides an effective means of nonlinear data processing of real-world classification problems. Despite many advantages of ANN, one of its main drawbacks is its inherent black box nature i.e., it is difficult to understand how ANN generate its output. Many algorithms have been designed to resolve this problem, which convert black box to white box by extracting rules from ANN. Rule extraction represents the internal knowledge of neural network in the form of symbolic rules. Rule extraction techniques can be categorized as decompositional, pedagogical and eclectic. Decompositional techniques involve analyzing the weights between units and activation function (“looks inside” the network) to extract rules. Pedagogical techniques treat the network as a “black box” and extract rules by examining the relationship between the inputs and outputs. Eclectic approaches incorporate both decompositional and pedagogical techniques together.6 Many rule extraction algorithms using ANN have been designed for classification based on those three techniques. KT7 and SUBSET8 are two well-known algorithms which fall under decompositional rule extraction technique. Fu7 has developed the KT algorithm that is able to handle ANNs with smooth activation function such as back propagation (BP) with sigmoidal where the activation function is bounded in the range [0, 1]. SUBSET algorithm that was suggested by Towell and Shavlik in 1993 specifies an ANN where the output of each neuron in the network is either close to zero or close to one and explicitly searches for subsets of incoming weights that exceed the bias on a unit. The MofN algorithm8 is an extension of SUBSET which clusters the weights of a trained network into equivalence classes and extracts m-of-n style rules. Setiono et al.9 have proposed a decompositional technique of rule extraction called NeuroRule. A component of NeuroRule is an automatic rule generation method called rule generation (RG). Each rule is generated by RG such that it covers as many samples from the same class as possible with the minimum number of attributes in the rule condition. Rule extraction (RX) is another decompositional rule extraction algorithm proposed by Setiono 10 that works on discrete data. RX recursively generates rules by analyzing the discretized hidden unit activations of a pruned network with one hidden layer. When the number of input connections to a hidden unit is larger than a certain threshold, a new artificial neural network is created and trained with the discretized activation values as the target output. Otherwise, the rule generation method X2R11 is applied to obtain rules that explain the 1750006-2
Rule Extraction from Training Data Using Neural Network
hidden unit activation values in terms of the inputs. Both NeuroRule and RX extract rules from ANNs that have been pruned as removing irreverent connections simplifies the rule extraction process, however both require discretization of continuous attributes of a data set before applying ANN. Setiono et al.12 have proposed another decompositional technique of rule extraction, NeuroLinear which is able to extract oblique classification rules from data set with continuous attributes. The full-RE method13 by Taha and Ghosh extracts accurate rules without normalizing or binarisation of the continuous attributes of a dataset prior to network training. For each hidden node, it generates intermediate rules based on linear combination of input attributes and then discretizes the input attributes to generate final rules and solves a linear programming problem to select the relevant discretization boundary. Anbananthen et al.14 have proposed Artificial Neural Network Tree (ANNT) method for rule extraction based on decompositional approach which generates less number of rules with more accuracy. Odajima et al.15 have proposed a Greedy Rule Generation (GRG) method for generating classification rules from a dataset with discrete attributes. A feed forward neural network with one hidden layer is trained and the GRG algorithm is applied to its discretized hidden unit activation values. Setiono16 has proposed a Rule Extraction algorithm (Re-RX) that generates classification rules from datasets having both discrete and continuous attributes. The algorithm is recursive in nature and generates hierarchical rules and rule conditions having the discrete attributes are disjoint for those having continuous attributes. Both GRG and Re-RX have used decompositional technique for rule extraction. Hara et al.17 have proposed an Ensemble-Recursive-Rule extraction (E-Re-RX) algorithm which is based on Re-RX algorithm16 and uses two artificial neural networks to achieve high recognition rates. Hayashi et al.18 have proposed three ensemble neural network rule extraction using recursive rule extraction algorithm which uses three ANNs to generate symbolic rules. Sestilo et al.19 have proposed a pedagogical technique of rule extraction called BRAINNE, which extracts rule from ANN using back propagation (BP) and does not require discretization of continuous data. The Trepan algorithm20 extracts a decision tree from a trained network, which is a pedagogical approach of rule extraction. The trained network is used as an “oracle” that is able to answer queries during the learning process and that determines the class of each instance that is presented in the query. Etchells et al. 21 have proposed an Orthogonal Search based Rule Extraction algorithm (OSRE) which is applied on support vector machines (SVM) and ANNs both. This algorithm converts given input into 1 from N form and then performs rule extraction based on activation responses. Guo et al.22 have proposed Binarized Input-Output Rule extraction (BIO-RE) that is a pedagogical approach of rule extraction and that extracts binary rules from any ANN. Augasta et al.5 have proposed a new rule extraction algorithm called Rule extraction by Reverse Engineering the Neural Network (RxREN) which extracts classification rules from a trained ANN using pedagogical approach. The algorithm relies on the reverse engineering technique to prune the insignificant input neurons and to discover the technological principles of each significant input neuron of
1750006-3
S. K. Biswas et al.
ANN. Awudu et al.23 have extended the trepan algorithm20 and proposed X-TREPAN algorithm for extracting decision trees from neural networks. Setiono et al.24 have proposed an eclectic rule extraction algorithm, Fast Extraction of Rules from Neural Networks (FERNN) to extract rules without network retraining which makes the process faster. FERNN first identifies the relevant hidden units by C4.5 algorithm based on their information gains and finds the sets of relevant network connections from the input units to this hidden unit by checking the magnitudes of their weights. Finally, it generates rules that distinguish the two subintervals of the hidden activation values in terms of the network inputs. Garcez et al.25 have developed a method to extract rules from a neural network by defining a partial ordering on the set of input vectors. Then, eclectic technique is applied to combine the elements of the decompositional and the pedagogical approaches. They have analyzed an ANN at the individual unit level and extracted rules at the global level. Jivani et al.26 have compared the three rule extraction techniques based on network architecture, efficiency, extracted rules and accuracy and shown that pedagogical approach is faster than both decompositional and eclectic approach of rule extraction. The RxREN5 is a rule extraction algorithm by reverse engineering the neural network which extracts classification rules from a trained ANN and relies on the reverse engineering technique to prune the insignificant input neurons and to discover the technological principles of each significant input neuron of the ANN. However the algorithm uses only misclassified patterns to find out the data ranges of each significant neuron in respective classes. But, misclassified and properly classified patterns, both are important to find data range of an attribute to correctly classify a pattern. By considering the misclassified patterns in absence of an attribute, only the unique patterns for that attribute can be determined. Unique patterns are those which can only be properly classified in presence of that respective attribute. But there may be some common patterns in the dataset which can be classified by more than one attribute. If any one of these attributes remains absent from the network, still those patterns will be properly classified because the other attribute(s) is present. So, if only misclassified patterns for an attribute are considered, those common and important patterns will never be taken into consideration for computing data range of the attribute. Therefore the proper and generalized data range for an attribute cannot be determined by considering only misclassified patterns. This problem can only be solved if along with misclassified patterns, classified patterns for an attribute are taken into consideration. Moreover for pruning insignificant attributes, RxREN5 considers 1% decrease in accuracy for pruned network and therefore, some significant attributes may be removed from the pruned network. Keeping all this points in consideration a Rule Extraction from Neural Network using Classified and Misclassified data (RxNCM) is proposed in this paper. The proposed algorithm is a modification of the RxREN.5 The proposed RxNCM uses both classified as well as misclassified data to find data ranges of significant attributes in respective classes, which are further used to extract rules. And also unlike 1750006-4
Rule Extraction from Training Data Using Neural Network
RxREN the algorithm does not consider 1% decrease in pruning accuracy. It only prunes the network if accuracy increases. The performance of RxNCM is validated with nine datasets and it is observed from the experimental results given in Tables 8–10 that the accuracy of the rules extracted by proposed RxNCM algorithm is more than the accuracy of the rules extracted by RxREN algorithm5 for classification tasks. The rules generated by RxNCM algorithm are more comprehensible (local comprehensibility) than RxREN 5 and X-TREPAN.23 RxNCM produces less number of rules than Garcez et al.25 for classification tasks. 2. Proposed Methodology The data is initially represented in such a way that it is suitable for ANN training and then a feed forward neural network is trained with the training dataset. Thereafter rule extraction process is executed, which uses the trained ANN and properly classified examples by the ANN. The rule extraction task comprises of pruning, data range computation, initial rule construction, rule pruning and rule update steps. All the steps involved in the proposed model are described below. 2.1. Data representation Representation of input and output attributes of a learning problem using ANN is one of the key factors, which influences the quality of solution(s) that one can obtain. Datasets are generally combination of various numeric, symbolic, image, text and missing values. Before training, data is represented in such a form that is suitable for ANN training by removing missing values and converting all mixed attribute values to numeric values. The data are collected in the form of patterns that have input attributes as antecedents and output attributes as consequences. 2.2. Artificial neural network training Sections, BP neural network with one hidden layer is taken. The number of nodes in input layer is same as the number of input attributes and number of nodes in output layer is only one. The number of hidden nodes, h is selected based on mean square error of the network. The number of hidden nodes is varied from (l + 1) to 2l where l is the number of input attributes. The network architecture which gives the smallest mean square error is selected as optimal architecture.27–29 The optimal trained ANN is taken for further experimentation. 2.3. Rule extraction by RxNCM The rule extraction step takes the trained ANN as input with l input neurons, h hidden neurons and n output neuron(s) for a given dataset and selects a set T of correctly classified examples from training dataset. The rule extraction process consists of pruning, data range computation, initial rule construction, rule pruning and rule update steps.
1750006-5
S. K. Biswas et al.
2.3.1. Pruning For each input neuron li of the trained ANN, the algorithm, RxNCM finds the incorrectly classified examples namely Ei of ANN without li on T and the number of examples erri in Ei, which is used to decide whether the ith input neuron is significant or not. To identify the insignificant input neurons, it computes minimum of the erri which is considered as threshold and forms the set consisting of li with erri = , i = 1,,m. An input neuron li is called insignificant if erri = , or, an attribute li with minimum misclassification rate is considered as insignificant, i.e., this attribute is not essential for classification of any of the patterns. Removal of this attribute will not affect the classification tasks. B is the set of insignificant input neurons.
B {li , i 1,
, m | erri
}
This forms the temporary pruned network by removing all insignificant neurons of B from the trained ANN and computes its classification accuracy Pacc on validation dataset. The algorithm considers this temporary pruned network as the pruned network and repeats this pruning process while (Pacc Nacc) where Nacc is the accuracy of the trained ANN on validation dataset. 2.3.2. Data range computation RxNCM finds the data ranges of significant input neurons. The RxREN5 uses only misclassified data to find the data range, but both classified and misclassified data for attributes are essential for proper classification of a pattern. The RxNCM first finds the properly classified examples Pi from T for each significant input neuron li of the pruned network. Then for each significant neuron li of the pruned network, it finds UCMi that is union of misclassified data Ei and properly classified data Pi for li on dataset T.
UCM i
{Pi
Ei }
mpi holds total number of examples in UCMi. A data length matrix as shown in Fig. 1 is created for finding the data range of ith attribute of the pruned network by placing examples belonging to UCMi within proper ranges with respect to respective classes, which finds the number of examples mcik belonging to each range. It should be noted that range of k lies from 1,,n.
Fig. 1.
Data length matrix.
1750006-6
Rule Extraction from Training Data Using Neural Network
Fig. 2.
Data range matrix.
Finally the algorithm generates a data range matrix as shown in Fig. 2 by finding the lower range Lik and upper range Uik of properly classified and misclassified data i.e., data from UCMi for each attribute li in respective class Ck of the pruned network. An attribute may not be significant for classifying patterns in all classes i.e., a particular attribute may not be necessary to classify all the n classes of a dataset, but can be required to classify any one target class or some k target classes, where k ≤ n. Therefore the algorithm selects data ranges of those attributes for each class, which satisfy the following condition:
mcik
mpi , [0.1, 0.5]
Let be a fraction value which specifies the minimum percentage of classified and misclassified data that should be required for knowledge discovery. DMik represents the data range of ith input in kth class. DM ik
[ Lik ,U ik ] if (mcik 0
α mpi )
otherwise
2.3.3. Rule construction The RxNCM algorithm uses derived mandatory data range of each significant input attribute li to construct rules for classifying data. The algorithm constructs rules for each target class Ck by using the nonzero data ranges available in the corresponding column k of the data range matrix. The rules are written in descending order of number of attributes required for each rule i.e., first rule is written for a class which requires more number of attributes. In general, rules can be written as, if ((data(l1) ≥ L11 ∧ data(l1) ≤ U11) ∧ (data(l2) ≥ L21 ∧ data(l2) ≤ U21) ∧ ∧ (data(lm) ≥ 𝐿𝑚1 ∧ data(lm) ≤ Um1)) then class = C1 else if ((data(l1) ≥ L12 ∧ data(l1) ≤ U12)) ∧ (data(l2)) ≥ L22 ∧ data(l2)) ≤ U22) ∧ ∧ (data(lm)) ≥ Lm2 ∧ data(lm)) ≤ Um2)) then class = C2 else ... if ((data(l1) ≥ L1n – 1 ∧ data(l1) ≤ U1n – 1) ∧ (data(l2) ≥ L2n – 1 ∧ data(l2) ≤ U2n – 1) ∧ ∧ (data(lm)) ≥ Lm n – 1 ∧ data(lm)) ≤ Umn – 1)) then class = Cn – 1 else Class = Cn. 1750006-7
S. K. Biswas et al.
2.3.4. Rule pruning The rule pruning step removes the irrelevant conditions from the initial rules if the accuracy increases after pruning. The algorithm first tests the accuracy Racc of the initial rule Rk with the validation dataset. It then calculates accuracy Rnewacc with validation dataset by removing each condition from Rk. The algorithm removes the condition cnj from Rk if Rnewacc > = Racc. cnj represents jth condition of a rule. The pruned rule Rk is as follow: Rk
( Rk Rk
cn j ) if ( Rnewacc
Racc )
otherwise
2.3.5. Rule update Unlike RxREN5 the proposed RxNCM algorithm considers the classified as well as misclassified data for finding the data range matrix and thus rules obtained from the data range matrix after pruning classifies maximum patterns correctly. However, there may be some misclassifications due to overlapping of data ranges of an attribute in different classes as shown in Fig. 3. The area between the solid line and dotted line shows the overlapping data ranges between two classes.
Fig. 3. Class 1 data overlapped with Class 2 data.
Some values may be very frequent for one class and less frequent for others within a range of data. Therefore, the rule update step improves the accuracy by shifting the upper or lower or both range of data. Consequently, it updates the data ranges involved in the rules that are based on the range of classified and misclassified data. Each condition, cnj in a rule consists of one lower limit value (L) or one upper limit value (U ) or both. Let minik and maxik be the minimum and maximum values of the attribute li for class Ck respectively on the new classified and misclassified data by the rule set R. The algorithm modifies the condition cnj if Rnewacc Racc where Racc be the classification accuracy of rule set R on validation dataset and Rnewacc be the accuracy of the newly modified rule set i.e., accuracy of rule set R after modifying the condition cnj by the new limits minik and maxik on validation dataset. The rule update continues if accuracy increases. 1750006-8
Rule Extraction from Training Data Using Neural Network
This step generalizes the rules by eliminating less frequent values and selects the optimal range of an attribute for a class, which covers more patterns of that class. 3. Rule Extraction from Neural Network Using Classified and Misclassified Data (RxNCM) Algorithm The outline of the algorithm is given as follows: Input: A trained feed-forward artificial neural network (ANN) with l input neurons, h hidden neurons and one output neuron, and a dataset with np examples. Output: Symbolic classification rules. Notations: T Nacc B Pacc M li Ck erri Ei Pi UCMi mpi mcik
— — — — — — — — — — —
a set of correctly classified examples by ANN on training dataset. accuracy of trained ANN on validation dataset. a set of insignificant input neurons. accuracy of pruned ANN on validation dataset. the number of input neurons in pruned ANN. ith neuron in the input layer. kth target class of a dataset. the number of incorrectly classified examples by trained ANN without li. incorrectly classified examples by trained NN without li. properly classified examples for input neuron li. properly classified and misclassified examples for significant input li of pruned network. — the total number UCMi of for significant input li of pruned network. — Number of UCMi for significant attribute li in class Ck
Pruning: Step 1. For each input neuron li of trained ANN, find the incorrectly classified examples namely Ei of ANN in the absence of Ii on T and let erri be the number of examples in Ei. Step 2. Compute the threshold = min(erri), i =1,,m. Step 3. Frame the set B = {Ii |erri = }, the set of insignificant input neurons. Step 4. Form the temporary pruned network by removing all the insignificant input neurons of B from the trained ANN. Step 5. Compute the accuracy Pacc for the temporary pruned network on validation dataset. Step 6. If (Pacc Nacc) then consider this temporary pruned network as the pruned network and go to Step 1 Else Stop the process.
1750006-9
S. K. Biswas et al.
Data range computation: Step 7. For each significant input neuron li in the pruned network: Find the properly classified examples in the pruned network namely Pi on T. Step 8. Group the examples belonging to UCMi = Ei Pi with respect to each target class Ck and find the number of examples mcik in each group where 1 ≤ k ≤ n. Step 9. Select only the classes of UCMi which satisfy the condition mcik > mpi where [0.1, 0.5] and find its minimum value namely Lik and maximum value namely Uik to construct the rules. Rule construction: Step 10. Arrange k in descending order by the number of attributes covered by each class Ck according to Step 9. Step 11. For k = 1 to n do the Steps 12 to 15 Step 12. j = 1 Step 13. For i = 1 to m Step 14. For the class k, if the ith input neuron is selected as per Step 9 then cnj = (data(li) Lik data(li) ≤ Uik) Step 15. If ( j = 1) then cn = cnj; else cn = cn cnj; increment j by 1. Step 16. Write the rule for class k using if-then rule format i.e., Rk = (if cn then Class = Ck). Rule pruning: Step 17. For each constructed rule Rk: Compute the classification accuracy by removing each cnj and if this accuracy is ≥ the already obtained accuracy then this cnj is to be removed from this rule. Rule update: Step 18. Classify the validation examples using the pruned rules. Step 19. Find the minimum and maximum values of the classified and misclassified examples of each class for each attribute of the pruned network. Step 20. If the accuracy is increased after considering the newly selected minimum and maximum values for the classified and misclassified data examples of classes then already existing min(L) and max(U ) values of the corresponding cnj of the rules need to be updated. Step 21. Modify the lower and upper limits of the cnj by respective min and max values of the classified and misclassified data examples while there exists some improvement in its accuracy. 4. An Illustrative Example Australian Credit Approval dataset taken from UCI repository is used as an example to explain how classification rules are generated using proposed RxNCM algorithm. It is a 1750006-10
Rule Extraction from Training Data Using Neural Network
mixed dataset of 690 patterns consisting of 6 numerical, 8 categorical attributes and one class attribute. The class value is either positive (+) or negative (–). The dataset consists of 307 positive and 383 negative classes. 70% of data i.e., 483 patterns, are used as training set and 20% of data i.e., 138 patterns, are used as validation set and 10% of data i.e., 69, patterns are used as test set. A single hidden layer ANN is trained with the training cases using back propagation algorithm. The algorithm uses 0.01 learning rate. The optimal architecture consists of 14 input nodes, 25 hidden nodes and one output node. The criterion of selecting number of hidden nodes is based on mean square error. The architecture having the least mean square error is selected as the optimal one. 425 patterns of the training set are properly classified during training. The pruning of irrelevant attributes is done using the properly classified 425 patterns based on error achieved by removing each attribute from the network separately. If the error of an attribute satisfies the condition mentioned in the algorithm, then the attribute is selected as an insignificant attribute. Two attributes among 14 are found to be insignificant attribute in the Australian Credit Approval database. 12 attributes remain in the pruned network after removing insignificant attributes from the trained network. Table 1 shows the accuracy of ANN and pruned ANN for the dataset. Table 1.
This is normal and pruning accuracy of the Australian credit approval dataset on test set.
Optimal Architecture
Learning Iterations
Accuracy of the Trained ANN on Testing Patterns
Accuracy of the Pruned ANN on Testing Patterns
14-25-1
1000
86.9565%
89.8551%
Table 2. Data ranges of the significant attributes using classified and misclassified data for Australian credit approval dataset. Class = P (“”)
Class = N (“’’)
Significant Input Neuron (Attribute)
Number of Classified and Misclassified Data (Patterns)
Range of Classified and Misclassified Data (Patterns)
Number of Classified and Misclassified Data(Patterns)
Range of Classified and Misclassified Data (Patterns)
2 3 4 6 7 8 9 10 11 12 13 14
0 0 166 0 0 0 0 0 0 0 0 0
[22]
198 188 0 202 182 215 187 232 232 232 170 204
[15.7549.58] [019] [17] [013.5] [01] [01] [012] [01] [13] [02000] [1395]
1750006-11
S. K. Biswas et al.
Properly classified and misclassified data for each significant attribute in respective classes of the pruned network are now selected to compute the data ranges. An attribute may or may not be necessary for classifying patterns in all the classes i.e., all the data ranges of an attribute in different classes may or may not be important. Therefore, the importance of each attribute in classifying a pattern in the respective class is found according to the condition mentioned in step 9 of the algorithm. The data ranges of those attributes that are important in classifying patterns in a particular class are considered. Table 2 shows the data ranges of the significant attributes using classified and misclassified data. Initial rule is now constructed as follows: if ((atr2 >=15.75 && atr2=0 && atr3=1 && atr6=0&& atr7=0 && atr8=0 && atr10=0 && atr11=1 && atr12=0 && atr13=1 && atr14=0 && atr10=1&& atr14=0 && atr10=1&& atr14