Neural Networks Greedy rule generation from discrete ...

5 downloads 748 Views 1018KB Size Report
the initial rule selection and the generation of the covering rules. .... Table 1. The distribution of Iris samples according to their petal length and petal width ...... Available from: http://www.ics.uci.edu/mlearn/mlrepository.html. Michalski, R.S. ...
Neural Networks 21 (2008) 1020–1028

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Greedy rule generation from discrete data and its use in neural network rule extraction Koichi Odajima a , Yoichi Hayashi a,∗ , Gong Tianxia b , Rudy Setiono c a Department of Computer Science, Meiji University, Tama-ku, Kawasaki 214-8571, Japan b Department of Computer Science, National University of Singapore, 3 Science Drive 2, Singapore 117543, Singapore c Department of Information Systems, National University of Singapore, 3 Science Drive 2, Singapore 117543, Singapore

article

info

Article history: Received 8 March 2007 Received in revised form 25 January 2008 Accepted 25 January 2008 Keywords: Neural networks Rule generation Discretization Clustering Classification

a b s t r a c t This paper proposes a GRG (Greedy Rule Generation) algorithm, a new method for generating classification rules from a data set with discrete attributes. The algorithm is “greedy” in the sense that at every iteration, it searches for the best rule to generate. The criteria for the best rule include the number of samples and the size of subspaces that it covers, as well as the number of attributes in the rule. This method is employed for extracting rules from neural networks that have been trained and pruned for solving classification problems. The classification rules are extracted from the neural networks using the standard decompositional approach. Neural networks with one hidden layer are trained and the proposed GRG algorithm is applied to their discretized hidden unit activation values. Our experimental results show that neural network rule extraction with the GRG method produces rule sets that are accurate and concise. Application of GRG directly on three medical data sets with discrete attributes also demonstrates its effectiveness for rule generation. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction As more algorithms for extracting classification rules from trained artificial neural networks are developed, neural networks are becoming an attractive alternative to other machine learning methods when one is faced with a decision making problem and an explanation of how each decision is made must be given. An advantage of neural networks is usually its higher predictive accuracy, and with a rule extraction algorithm, the drawback of a “black box” neural network prediction can be overcome. It is not surprising therefore that a large amount of effort has been devoted to the development of algorithms that extract logical rules from neural networks (Castro, Mantas, & Benitez, 2002; Duch, Adamczak, & Grabczekski, 2001; Fu, 1994; Huang & Xing, 2002; Krishnan, Sivakumar, & Bhattacharya, 1999; Setiono, Leow, & Zurada, 2002; Tsukimoto, 2000; Zhou, Jiang, & Chen, 2003). These algorithms extract crisp and fuzzy logical rules for classification, as well as rules for predicting continuous target values. The application of these algorithms can be found in a variety of diverse areas such as breast cancer diagnosis (Setiono, 2000), lymphoma diagnosis (Bologna, 2003), bioinformatics (Browne, Hudson, Whitely, Ford, & Picton, 2004),

∗ Corresponding author. E-mail address: [email protected] (Y. Hayashi). 0893-6080/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2008.01.003

loan application (Baesens, Setiono, Mues, & Vanthienen, 2003), ebusiness management (Elalfi, Haque, & Elalami, 2004), radar target identification (Remm & Alexandre, 2002), and modelling coastline suitability for breeding of fur seals (Bradshaw, Davis, Purvis, Zhou, & Benwell, 2002). This paper presents a rule generation algorithm for discrete data that can be incorporated into a neural network rule extraction method. The rule extraction method is similar to our previous methods NeuroRule and NeuroLinear (Setiono & Liu, 1996, 1997a), which fall under the category of decompositional techniques (Tickle, Andrews, Golea, & Diederich, 1998). A decompositional technique extracts the rules by analyzing hidden unit activation values of the neural network. This is in contrast to a method that is pedagogical which makes use only of the input–output mapping of the network, without explicitly analyzing the hidden unit activations. NeuroRule and NeuroLinear first generate a set of classification rules in terms of the discretized activation values. The resulting rule conditions which involve the discretized activation values are subsequently substituted by the original data attributes, and classification rules that explain the network outputs in terms of the input data are thus obtained. In order to obtain a concise set of rules, it is therefore very important to have an effective algorithm that generates rules from the discretized hidden unit activation values. In NeuroLinear, the algorithm X2R (Liu & Tan, 1995) was applied to generate such rules. The main contribution of this paper is the presentation of a greedy rule generating (GRG) algorithm that works on data sets

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

with discrete attributes. The algorithm is “greedy” because at each iteration, it generates the best classification rules in terms of the number of training samples classified, the number of attributes involved in the rule conditions, and the size of the input subspace covered. As samples that are covered by existing rules are removed from the data set when the next best rule is being generated, the resulting rules are ordered. The algorithm starts with one rule for every input subspace defined by a combination of the input attribute values. Rules that correspond to all possible combinations of attribute values form the initial set of rules. Some of the rules in this set can be merged because they classify data samples into the same category, or because no training data samples satisfy the rule conditions. The merged rules are added to the initial rule set and the best rule is selected from within this enlarged set. When GRG is applied to the discretized hidden unit activation values of pruned neural networks, the number of rules that are generated is much fewer than the number of rules generated by NeuroLinear and the decision tree method C4.5 (Quinlan, 1993). With similar accuracy rates, one would usually prefer the most concise set of rules as it is easier to comprehend and explain. The outline of this paper is as follows. In the next section we describe briefly the outline of the neural network rule extraction method. We then give the details of GRG for rule generation from data sets with discrete attributes. We also give a detailed example of how this algorithm works using the well-known Iris data set. In Section 3, we present and compare the results of our neural network rule extraction algorithm with C4.5 and NeuroLinear. Four publicly available data sets have been used in the experiments. An artificially generated problem with training data samples having two variables is also presented in this section to illustrate how classification rules are extracted using neural networks. The results from applying GRG directly on three medical data sets are also presented and compared with the results from existing rule generating methods in this section. Section 4 discusses the issue of scalability of the GRG algorithm. Finally, in Section 5 we conclude the paper. 2. Neural network rule extraction In this section, we describe how classification rules can be extracted from a pruned network. The steps are the same as those of NeuroLinear (Setiono & Liu, 1997a): 1. Cluster the hidden unit activation values. 2. Generate rule to explain the network’s outputs using the clustered activation values. 3. Replace the conditions of the the rules generated in Step 2 by the equivalent conditions involving the original input data. In Step 1 the clustering of the hidden unit activation values is achieved using the Chi2 algorithm (Liu & Setiono, 1995). Once the values have been clustered, classification rules that explain the network outputs are generated by X2R (Liu & Tan, 1995) in Step 2 of the method. When Step 3 is completed, a set of classification rules where each rule condition corresponds to a hyperplane dividing the input space is obtained. NeuroLinear and the current algorithm differ in the way the classification rules involving the clustered activation values are generated. A new rule generating algorithm that takes as input discrete-valued attributes has been developed to replace X2R. This new rule generating algorithm produces more concise sets of rules for the data sets that we have used in our experiments. Since only the hidden unit activation values of training data samples that have been correctly classified by the neural network are clustered in Step 1, and X2R is then used to generate rules with perfect accuracy using the clustered values in Step 2, all training data samples that are correctly classified by the network will also be correctly

1021

predicted by the rules. Hence, in general the extracted rules can be expected to have high fidelity, that is, they mimic the behaviour of the network very closely even on new unseen data. Our proposed GRG (Greedy Rule Generation) algorithm is basically a sequential covering algorithm (Mitchell, 1997). The sequential covering algorithm starts with an empty rule set, it then goes through the data set to generate one “best” rule to cover a subset of the data, removes the data subset covered, and iterates this process until no rule that meets a performance threshold can be generated. Different criteria for identifying the “best” rule during the rule generation process give rise to a variety of approaches that can be found in the literature. In his seminal paper, Rivest (1987) proposes the algorithm k-DL which searches for conjunctive rules of length at most k that are consistent with the data. In the AQ15 algorithm developed by Michalski, Mozetic, Hong, and Lavrac (1986), rules that assign data samples to a class are built starting from a “seed” sample selected for each class. A set of most general rules called a “star” that covers this sample and no other samples that belong to a different class is generated. Several criteria based on the precision of the rules are applied to evaluate these rules, and the best rule is selected accordingly. CN2 (Clark & Niblett, 1989) is a sequential covering algorithm which combines features of the AQ family of algorithms with decision tree pruning. The main idea is to remove the dependence of the rule generation process on specific data samples so that statistical measures such as the information-theoretic entropy measure and the likelihood ratio statistics can be applied to determine the quality of the rules. The GRG algorithm we are proposing here is closely related to these existing sequential covering algorithms. The difference lies in the initial rule selection and the generation of the covering rules. We present the GRG algorithm in the following subsection. 2.1. Greedy rule generation (GRG) algorithm Before we outline this rule generating algorithm, we need to define the following notation. 1. A discrete attribute Aj can have Nj possible values: Aj(1) , Aj(2) , . . . , Aj(Nj ) . 2. A rule R with J conditions has the following form: R = (RC1 , RC2 , . . . , RCJ ⇒ yi ), where

• RCj is the rule condition involving one or more values of attribute Aj . It has the following disjunctive form: if Aj is equal to Aj(j1 ) or Aj is equal to Aj(j2 ) or Aj is equal to Aj(j3 ) . . . , or Aj is equal to Aj(jm ) where j` ∈ {1, 2, . . . Nj }, ` = 1, 2, . . . m. • yi is the rule conclusion for the samples that satisfy the condition of rule R, i.e. the samples are classified as members of class i. 3. nj is the number of disjunctions in rule condition RCj , 0 ≤ nj ≤ Nj . An attribute Aj is considered irrelevant for this rule if nj = Nj . 4. Mergeable rules. Consider two rules (R1 C1 , R1 C2 , . . . , R1 CJ ) ⇒ y1 and (R2 C1 , R2 C2 , . . . , R2 CJ ) ⇒ y2 . If the following two conditions: • Condition 1. y1 = y2 or y1 = “unlabeled” or y2 = “unlabeled”, and • Condition 2. There exists exactly one index ˆi such that R1 Cˆi 6= R2 Cˆi are satisfied, then they are mergeable. The new rule is R0 = (R0 C1 , R0 C2 . . . R0 Cˆi , . . . , R0 CJ ) ⇒ yˆi , where • R0 Cˆi = R1 Cˆi ∪ R2 Cˆi , and

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

1022

• the rule conclusion is yˆi is  if y1 6= “unlabeled” y1 if y2 6= “unlabeled” yˆi = y2

 “unlabeled” otherwise. If the attribute Aˆi is an ordinal discrete variable where Aˆi(1) < Aˆi(2) < · · · < Aˆi(N ) , then in addition to the above two

conditions, we also impose

Table 1 The distribution of Iris samples according to their petal length and petal width Petal length

Medium

Large

Small

(49, 0, 0) (Setosa)

(1, 0, 0) (Setosa)

(0, 0, 0) “Unlabeled”

Medium

(0, 0, 0) “Unlabeled”

(0, 47, 0) Versicolor

(0, 1, 6) Virginica

Large

(0, 0, 0) “Unlabeled”

(0, 1, 4) Virginica

(0, 1, 40) Virginica

ˆi

• Condition 3. The values of this attribute in the two rule conditions R1 Cˆi and R2 Cˆi must be consecutive, that is, R1 Cˆi involves the value Aˆi(k) and R2 Cˆi involves the value Aˆi(k+1) , for some value of k. For the condition of the merged rule R0 , these two attribute values are merged into a single value as well, namely Aˆi(k) ∪ Aˆi(k+1) . We are now ready to state our rule generating algorithm. Algorithm GRG (Greedy Rule Generation) Input: Labeled data with discrete-valued attributes. Let J be the number of attributes, and Nj be the number of discrete values of attribute Aj , j = 1, 2, . . . , J. Output: Ordered classification rules. 1. Initialization. (a) Decompose the input space into J Y

S=

Nj

j=1

subspaces. (b) For each of these subspaces Sp , p = 1, 2, . . . S : i. Generate the rule Rp with J conditions (Rp C1 , Rp C2 , . . . , Rp CJ ), where Rp Cj involves exactly one value of attribute Aj . ii. Count the frequency of samples of class i in subspace Sp . Denote this frequency as Fp,i and let Fp,i = max Fp,i

Petal width Small

The numbers correspond to the frequencies of setosa, versicolor, and virginica, respectively. Step 1(b) of the GRG algorithm classifies the samples in each group according to the highest count.

In Step 2(d) above, the best rule is required to include R as the rule being generated must classify the samples in the subspace with the maximum frequency as determined in Step 2(a). Since the initial list contains all the rules that cover samples of the same class or “unlabeled” samples, it is possible that some of the merged rules generated in Step 2(c) no longer cover samples classified by R. Such rules will be excluded by the first condition that the selected rules must include R. The second condition is that the rule must cover the highest number of training data samples as we are generating ordered rules, rules that cover more samples are given higher importance. The third condition is to select a rule with the highest number of irrelevant attributes so that simpler rules involving fewer attributes will be generated. Finally, if there is a tie in the rules that satisfy all the three conditions, the rule that covers the largest input subspace will be added to the optimized rule set. The following example illustrates the details of the algorithm. 2.2. Illustrative example

i

iii. If Fp,i > 0, then let the conclusion of rule R be yi . Otherwise, denote the class label this for subspace as “unlabeled”. (c) Set the optimized rule set R = ∅. 2. Rule generation. (a) Let p and i be such that Fp,i = max Fp,i p,i

(b)

(c)

(d)

(e) (f)

(g)

and R be the rule for the samples in subspace Sp . If Fp,i = 0, stop. Generate a list of rules consisting of i. R, and ii. all other rules R0 such that y0 = yi or y0 = “unlabeled”. For all mergeable pairs of rules in the generated list (b), add the merged rule to the list and compute the class frequency of the samples covered by the new rule accordingly. Repeat this step until no new rule can be added to the list by merging.1 Among all the rules in the list, select the best rule R∗ using the conditions (in decreasing importance): i. it must include R, ii. it covers the the maximum number of samples with the correct label, iii. it has the highest number of irrelevant attributes, iv. it covers the largest subspace of the input. Let R := R ∪ R∗ . Set the class label of all samples in the subspaces covered by rule R∗ to “unlabeled” and their corresponding frequencies to 0. Repeat from Step 2(a).

1 This occurs when the rule resulting from any merging is already on the list.

The Iris data set (Fisher, 1936) is one of the most frequently used data sets to illustrate the effectiveness of a classification algorithm. It consists of 150 samples, each described by four attributes: sepal length, sepal width, petal length, and petal width. A sample belongs to one of the three varieties of iris flowers: setosa, versicolor, or virginica. For this illustration, suppose we have identified the attributes sepal width and sepal length to be irrelevant, and that the attributes petal length and petal width have been discretized into three subintervals as follows:

• Petal length:

– small: if petal length ∈ [0.00, 2.00) – medium: if petal length ∈ [2.00, 4.93) – large: if petal length ∈ [4.93, 6.90] • Petal width: – small: if petal width ∈ [0.00, 0.60) – medium: if petal width ∈ [0.60, 1.70) – large: if petal width ∈ [1.70, 2.50].

The 150 samples in the data set can be divided into nine groups according to their petal length and petal width (see Table 1). After Step 1 of the algorithm, we have six of the nine subspaces of the input attributes containing labeled samples. The remaining three subspaces remain unlabeled as there is no training data sample in these subspaces. Step 2(a) of the algorithm will generate a rule that identifies the samples in a subspace with the maximum number of samples that belong to a single class. In this case, the rule R = (petal length = small, petal width = small ⇒ setosa) is generated. After Step 2(b), the following set of rules is generated (we re-number the rules as R0 , R1 , R2 , . . . for simplicity of notation):

• R0 = (petal length = small, petal width = small ⇒ setosa) • R1 = (petal length = small, petal width = medium ⇒ setosa)

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028 Table 2 After one iteration of the algorithm, the first generated rule classifies all samples with petal length = small as setosa Petal length

Table 3 After two iterations, all but three samples that remain are from class virginica Petal length

Petal width Small

Medium

Large

Small

(0, 0, 0) “Unlabeled”

(0, 0, 0) “Unlabeled”

(0, 0, 0) “Unlabeled”

Medium

(0, 0, 0) “Unlabeled”

(0, 47, 0) Versicolor

(0, 1, 6) Virginica

Large

(0, 0, 0) “Unlabeled”

(0, 1, 4) Virginica

(0, 1, 40) Virginica

1023

Petal width Small

Medium

Large

Small

(0, 0, 0) “Unlabeled”

(0, 0, 0) “Unlabeled”

(0, 0, 0) “Unlabeled”

Medium

(0, 0, 0) “Unlabeled”

(0, 0, 0) “Unlabeled”

(0, 1, 6) Virginica

Large

(0, 0, 0) “Unlabeled”

(0, 1, 4) Virginica

(0, 1, 40) Virginica

The frequency counts in the covered subspaces are set to 0 and the labels are set to “unlabeled”.

• R5 = (petal length = large, petal width = small ⇒ “unlabeled”).

• R2 = (petal length = small, petal width = large ⇒ “unlabeled”) • R3 = (petal length = medium, petal width = small ⇒

• R6 := R0 ∪ R2 = (petal length = small or medium, petal width = medium ⇒ versicolor) • R7 := R0 ∪ R4 = (petal length = medium, petal width = small or medium ⇒ versicolor) • R8 := R1 ∪ R2 = (petal length = small, petal width = small or medium ⇒ “unlabeled”) • R9 := R2 ∪ R3 = (petal length = small, petal width = medium or large ⇒ “unlabeled”) • R10 := R1 ∪ R4 = (petal length = small or medium, petal width = small ⇒ “unlabeled”) • R11 := R4 ∪ R5 = (petal length = medium or large, petal width = small ⇒ “unlabeled”) • R12 := R6 ∪ R10 = (petal length = small or medium, petal width = small or medium ⇒ versicolor) • R13 := R3 ∪ R8 = (petal length = small, petal width = small or medium or large ⇒ “unlabeled”) • R14 := R5 ∪ R10 = (petal length = small or medium or large, petal width = small ⇒ “unlabeled”).

The results of merging of the above rules are the following:

“unlabeled”)

• R4 = (petal length = large, petal width = small ⇒ “unlabeled”). The following pairs of rules can be merged: (R0 , R1 ), (R1 , R2 ), (R0 , R3 ), and (R3 , R4 )2 . We have now the following additional rules:

• R5 := R0 ∪ R1 = (petal length = small, petal width = small or medium ⇒ setosa) • R6 := R1 ∪ R2 = (petal length = small, petal width = medium or large ⇒ setosa) • R7 := R0 ∪ R3 = (petal length = small or medium, petal width = small ⇒ setosa) • R8 := R3 ∪ R4 = (petal length = medium or large, petal width = small ⇒ “unlabeled”). Finally, we can merge rules R0 with R6 and R0 with R8 :

• R9 := R0 ∪ R6 = (petal length = small, petal width = small or medium or large ⇒ setosa) • R10 := R0 ∪ R8 = (petal length = small, medium or large, petal width = small ⇒ setosa). Among the above 11 rules, one is selected as the best rule R∗ . The first condition is that R∗ must include R. The condition excludes R1 , R2 , R3 , R4 , R6 and R8 . With the second condition that R∗ must have the maximum frequency count, we select R9 as a total of 50 samples are covered by this rule compared to 49 samples covered by rule R10 . At this point the rule set R = R9 . Step 2(f) of the algorithm sets the label of all the samples covered by rule R9 as “unlabeled”, resetting the counts of the number of samples in the corresponding subspaces to 0. This facilitates the rule generation for the remaining samples. After one iteration of the algorithm, we now have one rule and the sample data set as summarized in Table 2. We can see from the Table 2 that the highest frequency count is 47 versicolor for samples with medium petal length and petal width. Hence, for the new iteration R = (petal length = medium, petal width = medium ⇒ versicolor). The following rules are generated:

• R0 = (petal length = medium, petal width = medium ⇒ versicolor)

• R1 = (petal length = small, petal width = small ⇒ “unlabeled”) • R2 = (petal length = small, petal width = medium ⇒ “unlabeled”)

• R3 = (petal length = small, petal width = large ⇒ “unlabeled”) • R4 = (petal length = medium, petal width = small ⇒ “unlabeled”)

2 Since the discrete attributes have ordinal values, we merge two rules if they differ only in one condition and the two discrete values in the rule conditions must be consecutive, for example, “small and medium” or “medium” and “large”, but not “small” and “large”.

The best rule is R12 and it is included in the optimized rule set

R. After resetting the relevant class labels and frequency counts, the data samples that still need to be considered are as depicted in Table 3. In the last iteration of the algorithm, samples with large petal length and large petal width will be classified as virginica. Rule generation and merging will result in all nine groups being merged into one. This can be easily concluded from the Table 3, where all groups are either labeled as virginica or “unlabeled”. Hence, a default rule is obtained and it classifies all samples that do not satisfy the conditions of the two previously generated rules as virginica. When the algorithm GRG terminates, the following rule set that classifies the various iris flowers is obtained: IF petal length = small, then setosa ELSE IF petal length = small or medium, petal width = small or medium, then versicolor ELSE virginica or equivalently, IF petal length < 2.0, then setosa ELSE IF petal length < 4.93, petal width < 1.70, then versicolor ELSE virginica. This set of rules correctly classifies all but 3 of the 150 samples in the data set. Many other researchers have also used the Iris data set to demonstrate the effectiveness of their rule extraction algorithms. In addition to Ishikawa (2000), results from experiments with this data set can also be found in Fukumi and Akamatsu (1999). Both Ishikawa (2000) and Fukumi and Akamatsu (1999) were able to generate rules with fewer errors (one and zero, respectively). However we would like to emphasize here, that the purpose of our illustration is only to show in detail how GRG generates rules for the Iris data set. The three misclassification errors were already present in the data as summarized in Table 1 due to the Chi2 discretization applied to the original continuous attributes of the

1024

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

Fig. 1. Randomly generated two-dimensional data samples.

data. The rules generated by GRG also misclassify these same three samples and correctly classify all other samples.

is minimized during training and pruning is the standard sum of squared errors, but augmented with a penalty term: 1X

3. Experimental results We present in this section the results of our experiments from incorporating the algorithm GRG into NeuroLinear to extract classification rules from neural networks. We begin by presenting an experiment using artificially generated data samples in 2-dimensional space. We then present the results from the experiments on real-world data sets using NeuroLinear with GRG. Finally, we present the results from the application of GRG on three medical data sets that only have discrete attributes. 3.1. Illustrative example We show how the rule extraction algorithm from neural networks with GRG works using an artificially generated data set. The samples in this set have two attribute values x1 and x2 , each of which is randomly generated in the interval [0, 1] with uniform distribution. The class label for a sample is yi ∈ {0, 1, 2} and it is determined according to the following set of rules: IF x1 ≤ 0.25, then y = 0, ELSE IF x1 ≤ 0.75, and x2 > 0.75, then y = 0 ELSE IF x1 − x2 > 0.25, then y = 1, ELSE y = 2. The training and test data sets each consists of 1000 samples. Five percent of the training data samples are perturbed randomly to introduce noise in the class label of the samples. No noise in introduced in the test data set. The training samples are depicted in Fig. 1. The target outputs for the three classes are coded as (1, 0, 0), (0, 1, 0), (0, 0, 1) for yi = 0, 1, 2, respectively. A neural network with 3 input units, 8 hidden units and 3 output units is trained. The third input unit has input with a fixed value of +1 to model the bias in the hidden units. The error function that

2

kyi − y˜ i k2

i

+ 1 + 2

h X o X β(vmp )2 β(wm` )2 + m 2 2 1 + β(w` ) 1 + β(wm p ) m=1 p=1 m=1 `=1 h X n X

h X n X m=1 `=1

(wm` )2 +

h X o X

!

!

(vmp )2 .

(1)

m=1 p=1

The error in prediction is computed as the difference between the actual target output for the ith input sample yi and its predicted value by the neural network y˜ i . The activation values at the hidden unit as well as the output units are computed using the hyperbolic tangent function tanh(x). The variables in the above penalty term are as follows: wm ` is the weight of the connection from input unit ` to hidden unit m, vm p is the weight of the connection from hidden unit m to output unit p, h is the number of hidden units, n is the number of input units, o is the number of output units. The parameters β, 1 and 2 are positive penalty parameters, their values are set equal to 10, 0.1, and 10−5 , respectively. The hyperbolic tangent function is applied as the activation function at the hidden unit, while the sigmoid function is used as the activation function at the output unit. Connections in the network are removed as long as its accuracy rate on the training set is still at least 94.0%. The justification for the use of the penalty function (1) and the details of the network pruning algorithm can be found in our earlier paper (Setiono, 1995). The accuracy rates of the pruned network on the training and test data sets are 94.7% and 98.0%, respectively. It has 4 hidden units and 13 connections left (Fig. 2). The third output unit is not connected to any hidden units, hence, its activation value is always tanh(0) = 0.5 for all input samples. The activation values of the correctly predicted training samples at the four hidden units are discretized using the Chi2 algorithm (Liu & Setiono, 1995). The discretization results in the activation values being divided into subintervals as shown in Table 4. Notice

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

1025

Table 5 The data sets used to test the performance of NeuroLinear with GRG

Number of samples y=0

y=1

y=2

187 200

0 300

60 200

264 123 0

0 129 171

0 176 84

387

300

260

0 387

300 0

0 260

H1 :

H2 :

[−1.0, −0.9980) [−0.9980, −0.4013) [−0.4013, 1.00] H3 :

[−1.0, 1.00] H4 :

[−1.00, −0.0097) [−0.0097, 1.0]

# attributes

690 506

Cleveland heart disease Wisconsin breast cancer

297 699

8 discrete, 6 continuous 1 discrete, 12 continuous 5 discrete, 8 continuous 9 continuous

3.2. Experiments using real world data sets

Table 4 Subintervals found by Chi2 for the neural network in Fig. 1

[−1.0, 0.0132) [0.0132, 1.00]

# samples

Australian credit approval Boston housing data

The accuracy rates of the above rules are 95.9% and 99.5% on the training and test samples, respectively. These rules recover the decision boundaries of the problem accurately, but note how rules that cover larger areas of the input space appear before rules that cover smaller areas. Each of the first two rules cover approximately 28% of the input space. While samples with y = 0 cover approximately 37% of the input space, it is not possible to cover this area with one rule without having generated the first two rules that eliminate the samples with y = 1 or y = 2 in the square [0.25, 1]2 .

Fig. 2. A pruned neural network for the illustrative example.

Subinterval

Data set

that the activation values at hidden unit H3 are merged into a single interval, this implies that the values would not be used during rule generation. The following classification rules with hidden unit activation values in the rule conditions are generated by our rule generation algorithm GRG: IF H4 < −0.0097, then y = 1, ELSE IF H1 ≥ 0.0132 and H2 ≥ −0.9980, then y = 2, ELSE H2 < −0.4013, then y = 0, ELSE y = 2. The activation values have been computed as the hyperbolic tangent of the weighted inputs. From the connection weights shown in Fig. 2, we conclude that the activation values are computed as follows: H1 = tanh(−7.00x2 + 5.29) H2 = tanh(6.14x1 − 4.98)

The neural network rule extraction with GRG is also tested on publicly available data sets in order to check its effectiveness. These are Australian credit approval, Boston housing data, Cleveland heart disease and Wisconsin breast cancer data sets which can be obtained from the UCI website (Blake & Merz, 1998). The number of samples in each data set and the characteristics of their input attributes are shown in Table 5. The following penalty parameters are used during network training to obtain our results: β = 10, 1 = 0.1, 2 = 10−5 . The starting number of hidden units in the networks is 6, the number of output units is 2. The number of input units is equal to the number of input attributes shown in Table 5. The results from our method (NeuroLinear + GRG) as well as those from NeuroLinear and C4.5 are summarized in Table 6. The numbers shown in this table are the averages obtained from tenfold cross-validation runs. From the figures in the table, we can confidently conclude that while the accuracy rate from our method is similar or better than those from C4.5 and NeuroLinear, the rule extraction method which employs GRG generates much smaller rule sets for the 4 problems tested. For the Wisconsin breast cancer data set, the average number of rules is 2.0. This indicates that for each of the 10 fold-runs, one rule that classifies samples as belonging to one class (either benign or malignant) is generated. The other rule is the default rule for all samples that do not satisfy the conditions of the first rule. For the Cleveland heart disease data set, in 9 out of the 10 runs, two rules including a default rule are generated. Finally note that, while the GRG rules are ordered rules unlike the rules generated by C4.5, C4.5 generates more than three times and up to five times as many rules. Such a large number of rules are likely to decrease their overall comprehensibility. 3.3. Applying GRG to discrete data

H4 = tanh(−43.85x1 + 44.15x2 + 10.73).

Hence, H4 < −0.0097 if and only if tanh(−43.85x1 + 44.15x2 + 10.73) < −0.0097, or equivalently x1 − x2 > 0.24. After analyzing in similar fashion the rule conditions involving the other hidden unit activation values, we obtained the following set of rules in terms of the original input attributes of the data: IF x1 − x2 > 0.24, then y = 1, ELSE IF x2 ≤ 0.75 and x1 ≥ 0.25, then y = 2, ELSE IF x1 < 0.74, then y = 0, ELSE y = 2.

For data sets with discrete attributes only, it is possible to apply GRG directly to generate classification rules without employing NeuroLinear. In order to investigate the performance of GRG on this type of data and to compare it against similar rule generating methods, we tested GRG on the same three medical data sets used by Clark and Niblett (1989). These are the lymphography, breast cancer and primary tumor data sets. The characteristics of these data sets are given in Table 7. For all three data sets, 70% of the available data samples were randomly selected for training and the remaining 30% for testing.

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

1026

Table 6 Accuracy and number of rules generated C4.5, NeuroLinear and NeuroLinear with GRG Data set

C4.5

Australian credit approval Boston housing data Cleveland heart disease Wisconsin breast cancer

NeuroLinear

NeuroLinear + GRG

Acc. (%)

# rules

Acc. (%)

# rules

Acc. (%)

# rules

84.36 86.13 77.26 96.10

9.30 11.30 10.20 8.80

83.64 80.60 78.15 95.73

6.60 3.05 5.69 2.89

86.40 85.71 81.72 95.96

2.80 2.90 2.20 2.00

Table 7 Three medical data sets used to test GRG

• use a feature selection algorithm to select the most relevant attributes;

Data set

# of samples

# of attributes

# classes

• use a feature selection algorithm to select the most relevant

Lymphography Breast cancer Primary tumor

148 286 339

19 9 17

4 2 21

• use a neural network for classification and apply GRG to the

In Table 8 we show the accuracy and the number of rules generated by GRG. The figures are the averages obtained from running the algorithm 10 times using different random splits of the available data set into training and test sets. Given the relatively large number of attributes and their values, the number of initial subspaces generated in Step 1(a) of the algorithm will be overwhelming. In order to reduce this number, we modify the algorithm slightly by introducing attribute selection. First, the attributes are ranked according to their information gain (Quinlan, 1993). The top-ranked K attributes that will not generate more than the maximum allowable number of subspaces are then selected for rule generation. The two GRG results for each data set reported in the table for each data set are the results from imposing a maximum of 100 and 300 initial subspaces, respectively. More rules are generated when 300 initial subspaces are allowed compared to when just 100 subspaces are allowed, as expected. The accuracy rates of the former on both the training and test data samples are also higher. For comparison purpose, we show the results from Clark & Niblett’s CN2 method and AQR, their implementation of the AQ rule generating method. From this comparison, we can conclude that GRG generates rules that have higher predictive accuracy than CN2 and AQR. The accuracy of GRG on the training data is also higher than CN2. Of the three methods, AQR generates the most rules. The large difference in the training and test set accuracy is an indication that AQR rules overfit the training samples. When the number of initial subspaces is limited to a maximum of 100, GRG generates similar numbers of rules as CN2. In terms of test set accuracy, however, GRG outperforms CN2 on all three data sets. 4. Scalability of the GRG algorithm While it is possible to apply the GRG algorithm directly to generate classification rules from data sets with discrete attribute values only, the potential application of the algorithm is severely restricted by the large number of search subspaces it generates. For data sets with a large number of attributes, it may be possible to limit the number of subspaces to say, 100 or 300 by selecting only the useful input attributes as we did in our experiments presented in the previous section. Relevant attributes were selected using information gain measure, however any feature selection algorithms could have been utilized. Among them, neural networks have been shown to be very effective in selecting relevant features for a wide variety of classification problems (Setiono & Liu, 1997b). It would be even better if relevant individual attribute values can be identified and selected. In summary, the following are steps that could be taken before applying GRG on data sets with many attributes and/or attribute values:

attribute values; discretized hidden unit activations. We note that GRG was designed originally as a component of decompositional neural network rule extraction algorithms such as NeuroLinear. Specifically, it is applied to generate rules that explain the network’s outputs in terms of the discretized hidden unit activation values. The combination of neural network learning and the GRG rule generating capability can provide the user with an alternative and powerful tool for data mining. We illustrate this point using the widely used mushroom data set that is available from the UCI repository (Merz & Murphy, 1996). The data set consists of 8124 samples with 22 attributes. In order to select individual attribute values that are relevant, we binarycode the data such that there are 126 binary-valued attributes in total. Each original attribute with N values is converted into N binary attributes. Of the 8124 available samples, 1000 are randomly selected for neural network training. A neural network with one hidden unit correctly classifies all the training samples but misclassifies eight samples not in the training set. In order to generate rules that correctly classify all the 8124 samples, we include these eight samples in the training set and train neural networks with 1008 training samples. Many neural networks with one hidden unit are found to correctly classify not only the 1008 training samples, but also all the 8124 samples in the data set. As there is only one hidden unit in a network and the classification problem is binary to distinguish between edible and poisonous mushrooms, a simple rule of the form “if activation value is less than a threshold, then edible; otherwise poisonous” is generated when GRG is applied to the discretized activation values. Selection of the relevant attribute values is achieved by pruning the connections from the input units to the sole hidden unit in the network. It reduces the number of input units from 126 to as few as seven. With such a small number of relevant binary inputs, GRG can be applied again to generate rules that explain what kind of input combination results in a hidden unit activation that is less than the threshold. With different initial random weights for the neural networks, different pruned networks are obtained. From these pruned networks, some of the rules generated are as follows: Rule Set 1. Edible mushroom: R1 : if (Odour = none) and (Gill-size 6= narrow) and (Spore-print-

colour 6= green) OR R2 : if (Odour = anise) OR R3 : if (Odour = almond) OR R4 : if (Odour = none) and (Stalk-surface-above-ring 6= silky) and

(Spore-print-colour 6= green) and (Population 6= clustered). Rule Set 2. Edible mushroom: R1 : if (Odour = none) and (Gill-size 6= narrow) and (Spore-print-

color 6= green) OR R2 : if (Odour = anise) OR R3 : if (Odour = almond) OR

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

1027

Table 8 Accuracy and complexity of three rule generating algorithms Data set

CN2

AQR

Acc. (%)

# rules

Train

Test

Lymphography

91

82

Breast cancer

72

Primary tumor

37

GRG

Acc. (%)

# rules

Train

Test

12

100

76

71

4

100

36

19

75

R4 : if (Bruises = no) and (Odour = none) and (Stalk-surface-

below-ring 6= scaly). And below are two sets of rules that describe poisonous mushrooms: Rule Set 3. Poisonous mushroom: R1 : if (Odour 6= none) and (Odour 6= anise) and (Odour 6= almond)

OR R2 : if (Spore-print-colour = green) OR R3 : if (Gill-size = narrow) and (Population = clustered) OR R4 : if (Gill-size = narrow) and (Stalk-surface-below-ring = scaly).

Rule Set 4. Poisonous mushroom: R1 : if (Odour 6= none) and (Odour 6= anise) and (Odour 6= almond)

OR R2 : if (Gill-size = narrow) and (Population = clustered) OR R3 : if (Gill-spacing = close) and (Gill-size = narrow) and (Spore-

print-colour = white) OR R4 : if (Spore-print-colour = green).

Each of the above rule sets describe either an edible or a poisonous mushroom. Samples that satisfy none of the rule conditions are classified as belonging to the other class by default. We have generated such rules by slightly modifying the algorithm so that we can compare our results with those reported by Ishikawa (2000). Rule Set 3 is actually the negation of the rules obtained by Ishikawa for edible mushrooms. However, we may conclude that in addition to this rule set, there are other sets consisting of four rules and involving only five of the 22 original data attributes which also achieve perfect classification of all 8124 samples in the mushroom data set. These rules can be discovered by our GRG algorithm as part of a neural network rule extraction algorithm. 5. Discussion and conclusion We have proposed GRG, an algorithm for generating classification rules from data sets with discrete attributes. The effectiveness of this method was tested by employing it as a tool to extract logical rules from neural networks as well as by testing it directly on data sets having only discrete attributes. Our experimental results on publicly available test data sets indicate that the algorithm can generate concise and accurate rule sets. NeuroLinear extracts rule from neural networks by analyzing the activation values at the hidden units of a pruned network. These activation values which range in the interval [−1, +1] are first clustered by the discretization method Chi2. Once they have been clustered, they can be used as input to our rule generating method GRG. From our experimental results, we discover that the number of rules generated by GRG for the four real-world data sets is much smaller than the number of rules generated by the widelyused decision tree method C4.5 or by NeuroLinear with X2R. The performance of GRG when applied directly to discrete data is compared with CN2 and AQR, two algorithms which belong to

Acc. (%)

# rules

Train

Test

76

87.06 93.05

83.78 86.51

11.7 17.2

72

208

76.62 80.67

75.53 77.39

5.9 20.0

35

562

43.92 49.92

41.20 47.30

20.0 24.2

the family of sequential covering algorithms. For the three data sets from medical domain that we tested, GRG generates rules that are more accurate in predicting the samples in the test set than CN2 and AQR. The number of rules generated by GRG is similar to CN2, but much fewer than AQR. Finally, for the mushroom data set, we discover several rule sets consisting of four disjunctive rules and involving just five attributes that are capable of distinguishing edible mushrooms from poisonous ones with perfect accuracy. References Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural network rule extraction and decision tables for credit-risk evaluation. Management Science, 49(3), 312–329. Blake, C.L., & Merz, C.J. (1998). UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html. Bologna, G. (2003). A model for single and multiple knowledge based neural networks. Artificial Intelligence in Medicine, 28(2), 141–163. Bradshaw, C. J. A., Davis, L. S., Purvis, M., Zhou, Q., & Benwell, G. L. (2002). Using artificial neural networks to model the suitability of coastline for breeding by New Zealand fur seals (Arctocephalus forsteri). Ecological Modeling, 148(2), 111–131. Browne, A., Hudson, B. D., Whitely, D. C., Ford, M. G., & Picton, P. (2004). Biological data mining with neural networks: Implementation and application of a flexible decision tree extraction algorithm to genomic problem domains. Neurocomputing, 57, 275–293. Castro, J. L., Mantas, C. J., & Benitez, J. (2002). Interpretation of artificial neural networks by means of fuzzy rules. IEEE Transactions on Neural Networks, 13(1), 101–116. Clark, P., & Niblett, T. (1989). The CN2 Induction Algorithm. Machine Learning Journal, 3(4), 261–283. Duch, W., Adamczak, R., & Grabczekski, K. (2001). A new methodology of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 12(2), 277–306. Elalfi, A. E., Haque, R., & Elalami, M. E. (2004). Extracting rules from trained neural network using GA for managing E-business. Applied Soft Computing, 4(1), 65–77. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Fu, L. M. (1994). Rule generation from neural networks. IEEE Transactions on Systems, Man and Cybernetics, 243(8), 173–182. Fukumi, M., & Akamatsu, N. (1999). A new rule extraction method from neural networks. In Proceedings of IJCNN99 (pp. 1–5). Huang, S. H., & Xing, H. (2002). Extract intelligible and fuzzy rules from neural networks. Fuzzy Sets and Systems, 132(2), 233–243. Ishikawa, M. (2000). Rule extraction by successive regularization. Neural Networks, 13(10), 1171–1183. Krishnan, R., Sivakumar, G., & Bhattacharya, P. (1999). A search technique for rule extraction from trained neural networks. Pattern Recognition Letters, 20(3), 273–280. Liu, H., & Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th international conference on tools for artificial intelligence (pp. 388–391). Liu, H., & Tan, S.T. (1995). X2R: A fast rule generator. In Proceedings of the 7th IEEE international conference on systems, Man and Cybernetics (pp. 631–635). Merz, C.J., & Murphy, P.M. (1996). UCI Repository of machine learning databases. Irvine, CA: Department of Information and Computer Science, University of California. Available from: http://www.ics.uci.edu/mlearn/mlrepository.html. Michalski, R.S., Mozetic, I., Hong, J., & Lavrac, N. (1986). The multi-purpose incremental learning AQ15 and its testing application to three medical domains. In Proceedings of the 5th national conference on AI (pp. 1041–1045). Mitchell, T. (1997). Machine learning. New York: The Mc Graw-Hill Companies Inc. Quinlan, R. (1993). Programs for machine learning. California: Morgan Kaufmann Publisher Inc. Remm, J.-F., & Alexandre, F. (2002). Knowledge extraction using artificial neural networks: Application to radar identification. Signal Processing, 82(1), 117–120.

1028

K. Odajima et al. / Neural Networks 21 (2008) 1020–1028

Rivest, R. L. (1987). Learning decision lists. Machine Learning, 2(3), 229–246. Setiono, R. (1995). A penalty function approach for pruning feedforward neural networks. Neural Computation, 9(1), 185–204. Setiono, R., & Liu, H. (1996). Symbolic representation of neural networks. IEEE Computer, 29(3), 71–77. Setiono, R., & Liu, H. (1997a). NeuroLinear: From neural networks to oblique decision rules. Neurocomputing, 17(1), 1–24. Setiono, R., & Liu, H. (1997b). Neural-network feature selector. IEEE Transactions on Neural Networks, 8(3), 654–662. Setiono, R. (2000). Generating concise and accurate classification rules for breast cancer diagnosis. Artificial Intelligence in Medicine, 18(3), 205–219.

Setiono, R., Leow, W. K., & Zurada, J. (2002). Extraction of rules from artificial neural networks for nonlinear regression. IEEE Transactions on Neural Networks, 13(3), 564–577. Tickle, A. B., Andrews, R, Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE Transactions on Neural Networks, 9(6), 1057–1068. Tsukimoto, H. (2000). Extracting rules from trained neural networks. IEEE Transactions on Neural Networks, 11(2), 377–389. Zhou, Z.-H., Jiang, Y., & Chen, S.-F. (2003). Extracting symbolic rules from trained neural network ensembles. AI Communications, 16(1), 3–15.

Suggest Documents