Refining Expert Knowledge with an Artificial Neural Network

10 downloads 0 Views 71KB Size Report
Rule refinement as described by Towell & Shavlik [1993] is a three step process involving (i) initialisation of a Knowledge. Based Artificial Neural Network, ...
Refining Expert Knowledge with an Artificial Neural Network Robert Andrews and Shlomo Geva Neurocomputing Research Centre, Queensland University of Technology, GPO Box 2434 Brisbane, Australia 4001.

[email protected] [email protected] in Progress in Connectionist-Based Information Systems, Proceedings of the International Conference on Neural Information Processing (ICONIP’97), Kasabov et.al.,Springer,1997, pp847-850.

Abstract This paper describes RULEIN/RULEX; an automated technique for the refinement of a knowledge base. RULEIN constructs a Rapid Backprop (RBP) network from an initial, partially complete/accurate knowledge base which is formulated as a set of propositional rules. The RBP network is then trained on a set of examples drawn from the problem domain. RULEX is then applied to the weights of the trained network to extract a set of !refined propositional rules. The refined rule set represents the original knowledge base modified in the light of network training. Network training has the potential to remove inaccuracies in the original rule base, supplement partially correct initial rules, and add new rules. Rule initialisation can also speed up network learning by obviating the necessity of starting training from a tabula rasa configuration. RULEIN/RULEX is evaluated using rule !quality criteria and results are presented for some benchmark problems. The method has application in many areas but is particularly suited to overcoming the so called !knowledge acquisition bottleneck in the knowledge engineering phase of rule based expert system construction.

1 Introduction Rule refinement as described by Towell & Shavlik [1993] is a three step process involving (i) initialisation of a Knowledge Based Artificial Neural Network, (KBANN), with a nearly correct domain theory (knowledge base); (ii) network training with examples drawn from the problem domain; (iii) extraction of a refined set of rules from the trained network. Where KBANN requires initialisation with a nearly correct domain theory, the RULEIN/RULEX algorithm described in this paper is capable of refining a weak domain theory that includes partially correct, partially complete and even inaccurate rules. The ability to refine a domain theory in the manner described above has the potential to overcome the !knowledge acquisition bottleneck in the knowledge engineering phase of rule based expert system construction. Rule initialisation can also speed up network training by obviating the necessity of starting training from a tabula rasa configuration. The remainder of the paper is organised as follows: section 2 describes the Rapid Backprop (RBP) network; section 3 shows how RULEX decompiles the hidden units of the RBP network into propositional rules; section 4 describes the RULEIN algorithm for inserting rules into the RBP network; and sections 5 & 6 present results of applying RULEIN/RULEX to some benchmark problems.

2 The RapidBackprop Network Lapedes & Farber, [1987] give a method for constructing locally responsive units using pairs of axis parallel sigmoids, one pair per input dimension, such that the sum of the activations of each pair defines a response region in each input dimension. The RBP network was developed by Geva & Sitte, [1992, 1994] who

extended Lapedes & Farber s work by describing a parameterisation and training scheme for networks composed of such sigmoid based hidden units and Andrews and Geva, [1995], showed how these networks can be structured to facilitate rule extraction. The RBP network consists of an input layer, a hidden layer of local basis function units, and an output layer. In the ith dimension, the sigmoids of a local basis function unit are paramaterised according to centre, ci , breadth, b i , and edge steepness, ki . The pairs of sigmoids are given by the equations:

1

+ Ui =

…(1)

−( x i − c i+b i )k i

1+ e 1 Ui = 1+ e −( xi −c i −b i )k i

…(2)

where x is the input vector, c is the reference vector, b represents the effective widths of each ridge, and k represents the edge steepness of each ridge. To maintain the !shape of the ridges during training the edge steepness is related to the ridge width as:

ki = K

…(3)

0 bi

where K0 is the initial ridge steepness value set in the range [4..8]

Fig.1

Fig.2

The combination Ui + - Ui- forms a ridge parallel to the axis in the ith dimension. (See Figure 1 above.) The intersection of N such ridges forms a local peak at the point of intersection but with secondary ridges extending away to infinity on each side of the peak. (See Figure 2 above). These ridges can be ’cut off’ by the application of a suitable sigmoid to leave a locally responsive region. (See Figure 3 below.) The activation for this sigmoid is given by:

V=

1 1+ e

−(



(U i+ −U i− ) − B ) K

…(4)

where B is set to the dimensionality of the input domain and K is set in the range [4..8].

ridges other than the ith ridge have maximum activation. BK ln( minact( U +i U i) =

wq

1)

OT

(B1)maxact( U +bU b)

K

…(6)

where B is the dimensionality of the input patterns, wq is the output weight of the hidden units, OT is the activation threshold and maxact(Ub+ - Ub-) is the maximum possible activation for any ridge in the hidden unit. As B, K, wq, and OT are all constants, and maxact(Ub+ - Ub-) can be calculated, the value of the minimum activation of the ith ridge, minact(Ui + - Ui -), can be calculated in a straightforward manner. Let α = minact( U +i − U i) , m = e−( xi −ci )k i , and n = ebi k i From equations (6), (1) and (2) we have

Fig.3

α=

The network output is given by:

O=

N

Vµ wµ ∑ µ

…(5)

1

1 m 1 + mn 1+ n

…(7)

=1

An incremental, constructive training algorithm is used with training, (for rule extraction), involving adjusting, by gradient descent, the centre, ci , breadth, b i , and edge steepness, ki , parameters of the sigmoids that define the local response units. During training for classification problems the output weight, w, is held constant at a value such that the hidden units are prevented from !overlapping , ie, no more than one unit contributes appreciably to the network output. This measure facilitates rule extraction by allowing individual hidden units to be interpreted as rules in isolation of all other units.

3 The RULEX Rule Extraction Algorithm A rule set can be formed from the network solution by interpreting each hidden unit as a single IF..THEN rule. The antecedents for the rule associated with a hidden unit are formed from the conditions which cause the hidden unit to produce appreciable output, ie, an input pattern lies wholly within the hypercube represented by the responsive area of the hidden unit, (each component xi of the input vector x lies within the active range of its corresponding ridge). The rule describing the behaviour of the hidden unit will then be of the form: IF THEN

æ 1 # i # n : xi 0 [ xi lower , xi upper ] pattern belongs to the target class

where xi lower represents the lower limit of activation of the ith ridge and xi upper represents the upper limit of activation of the ith ridge and n is the dimensionality of the input. The range of input values, [ xi lower , xi upper ], in the ith input dimension which will produce appreciable output from the hidden units corresponding ith ridge can be calculated directly from the network equations without the need for employing a ’search and test’ strategy. From equation (4) we see that the output of a hidden unit is determined by the sum of the activations of all its component ridges. Therefore, the minimum possible activation of an individual ridge, the ith ridge say, in a hidden unit that has activation barely greater than its threshold will occur when all

Solving equation (7) for m and then backsubstituting for m and n we have: (1 − α )ebk − ln(

( α + 1) ebk

+ e2bk ( α − 1)2 + ( 2α ki

xi = ci +

α +1 2 ) − 2( α 2 + 1) ebk )

…(8)

Thus for the ith ridge the values of the extremities of the active range, xi lower and xi upper are given by the expressions:

xi lower = ci − xi lower = ci +

β lower ki β upper

…(10)

…(11)

ki

where lower is the negative root of the ln expression in equation (8) above and upper is the positive root.

4 The RULEIN Rule Initialisation Algorithm In turning a propositional if-then rule into the parameters that define a local response unit it is necessary to determine from the rule the active range of each ridge in the unit to be configured. This means setting appropriately the upper and lower bounds of the active range of each ridge, xi lower and x i upper, and then calculating the centre, breath, and steepness parameters, (ci , bi , ki ), according to the equations given below. Setting xi lower and xi upper , appropriately involves choosing values such that they ’cut off’ the range of antecedent clause values. For discriminating ridges, ie, those ridges that represent input pattern attributes that are used by the unit in classifying input patterns, these required values will be those that are mentioned in the antecedent of the rule to be encoded. For non-discriminating ridges the active range can be set to include all possible input values in the corresponding input dimension. (Non-discriminating ridges will be those that correspond to input pattern attributes that do not appear as antecedent clauses of the rule to be encoded.)

The ridge centre ci , can be calculated as:

ci =

xi upper − xi lower

…(11)

2

Table 1. Problem Domain

The edge steepness ki , can be calculated as:

β i lower ci - xi lower

MONK1

RBP Extracted Fidelity Number Number of Test Set Rules Antecedents of Accurac Test Set Extracted per Rule y Accuracy Rules 100%

100%

100%

4

1.75

84.77%

82.9%

97.8%

15

4.5

97.2%

97.2%

100%

1

2

91%

89%

97.8%

3

7

85.5%

83.5%

97.67%

4

7

After the network has been initialised hidden units that represent rules that are known to be accurate or deemed to be important can be !frozen so that they will not be altered during network training.

Mushroom 99.8%

99.8%

100%

3

5 Evaluating RULEX Against Rule Quality

These results show that RULEX is capable of extracting accurate, comprehensible rules that display high fidelity with the network from which they were extracted. RULEX has a significant advantage over other decompositional rule extraction techniques in that the rules can be directly extracted from the network weights without the necessity of employing an exhaustive !or heuristically limited !search & test strategy [Fu, 1991, 1994], [Saito and Nakano, 1988] to determine the conditions under which a hidden/output unit will be active.

ki =

…(12) MONK2 MONK3

From equation (3) the breadth bi , can be calculated as:

bi =

K0 ki

Iris

…(13)

Criteria In this section RULEX is evaluated according to the criteria described by Towell and Shavlik [1993] and revised by Andrews, Diederich and Tickle [1995]. These criteria are (a) accuracy; (b) fidelity; (c) consistency; and (d) comprehensibility of the extracted rule sets. In this context a rule set is considered to be accurate if it can correctly classify previously unseen examples. (Accuracy is calculated as the percentage of correctly classified patterns in the test set data.) Similarly a rule set is considered to display a high level of fidelity if it can mimic the behaviour of the Artificial Neural Network from which it was extracted by capturing all of the information embodied in the ANN. (Fidelity here is calculated as the percentage agreement between the generalisation performance of the ANN on the test set and the generalisation performance of the extracted rule set on the test set data.) A rule extraction algorithm is deemed to be consistent if, under different runs of the algorithm on the same trained ANN and test set the algorithm generates the same rule set. Finally the comprehensibility of a rule set is determined by measuring the size of the rule set (in terms of the number of rules) and the number of antecedents per rule. Experiments were conducted on a number of datasets from the University of California -Irvine machine learning repository. The datasets involved included datasets of purely discrete valued input, (MONK problems, Mushroom), datasets that were purely continuous valued, (Iris), as well as a dataset that contained a mixture of discrete and continuous valued inputs, (Wisconsin Heart Disease). In the case of the MONK datasets results are averages over 100 trials. 10 fold cross validation was used for the remaining datasets with the results presentedin Table 1 being averages over the 10 folds.

Wisconsin Heart Disease

16

6 Evaluating RULEIN To evaluate the efficacy of the RULEIN procedure experiments were conducted using the MONK1 dataset [9]. Here the target concept is given by the rule IF Jacket Colour = Red OR Head Shape = Body Shape THEN pattern is in the target class. The objective of the first experiment was to determine the effect of initialising the RBP network with a partially correct theory on the training time and accuracy of the trained model as well as the quality of the extracted rule set. Results presented are averages. Table 2 below shows results where the inserted rules where fixed. Table 3 gives results for the same experiment where the inserted rules were allowed to be modified during training. Table 2. Inserted Rules Fixed During Training Domain Theory RBP RBP Test Number of Extracted Used for Training Set Extracted Rule Set Network Epochs Accuracy Rules Accuracy Initialisation No Domain 260 100% 4 100% Theory Inserted Correct Domain Theory 0 100% 4 100% Inserted (4 Rules) 3 Correct Rules 20 100% 4 100% 2 Correct Rules 100 100% 4 100% 1 Correct Rule 150 100% 4 100% 1 Incorrect 290 100% 5 100%

Rule Table 3. Inserted Rules Allowed To Be Modified During Training Domain Theory RBP Used for Training Initialisation Epochs No Domain 260 Theory Inserted Correct Domain Theory Inserted 0 (4 Rules) 3 Correct Rules (Including 54 Jacket=Red) 3 Correct Rules (Excluding 75 Jacket=Red) 2 Correct Rules (Including 170 Jacket=Red) 2 Correct Rules (Excluding 535 Jacket=Red) 1 Correct Rule (Including 190 Jacket=Red)

1 Correct Rule (Excluding Jacket=Red) 1 Partially Correct Rule 1 Incorrect Rule

RBP Test Set Accuracy

Number of Extracted Rules

Extracted Rule Set Accuracy

100%

0

100%

100%

4

100%

100%

4

100%

95.6%

6

95.8%

99.76%

5

100%

93.75%

5

93.75%

100%

4

100%

600

90.2%

8

88%

250

100%

4

100%

680

98%

5

98%

Conclusion RULEIN/RULEX have been demonstrated to be capable of performing rule refinement on domain theories that are weak. RULEIN is able to convert a set of propositional rules into the parameters that define the hidden unit basis functions of an RBP network. Initialisation and subsequent training has been shown to be faster than beginning training from a tabula rasa configuration. The number of training epochs required is inversely related to the strength of the domain theory used to initialise the network. The RBP network has been shown to be able to remove inaccuracies from inserted rules (in the case where the rules are not fixed during training), and to add extra rules to supplement the initial domain theory in the case where the intial domain theory was weak. !Freezing inserted rules that are known to be accurate has been shown to be beneficial in biasing the network towards a !correct solution in minimal time. In cases where an incorrect rule was frozen, the network was able to learn a correct solution by adding in an extra rule that effectively cancelled out the incorrect rule. In cases where the network was initialised with a weak domain theory and was free to modify this theory during training it did not always converge to an optimal solution. This may have been because there was sufficient support in the training set for the modified representation developed during training and not enough counter examples to force the network to develope the correct solution. This phenomenon deserves further study. RULEX can extract comprehensible, accurate rules from the RBP network in a computationally efficient manner and in some cases the extracted rule set outperforms the network from which it was extracted. The

RULEIN/RULEX method has application in many areas but is particularly suited to overcoming the so called !knowledge acquisition bottleneck in the knowledge engineering phase of rule based expert system construction.

References [Andrews, Diederich and Tickle, 1995] R.Andrews, J.Diederich and A.Tickle, Survey and Critique of Techniques for Extracting Rules From Trained Artificial Neural Networks, Knowledge Based Systems 8(6), 373-389, 1995. [Andrews and Geva, 1995]R.Andrews and S.Geva S, RULEX & CEBP Networks as the Basis For a Rule Refinement System, in Hybrid Problems Hybrid Solutions, Hallam J.(Ed), IOS Press, 112,1995. [Fu, 1991] L.M.Fu, Rule Learning By Searching on Adapted Nets, Proceedings of the Ninth International Conference on Artificial Intelligence, AAAI/MIT Press, Anaheim CA, 590-595, 1991. [Fu, 1994] L.M.Fu, Rule Generation From Neural Networks, IEEE Transactions on Systems, Man, and Cybernetics, 24(8), 1114-1124, 1994. [Geva and Sitte, 1992 ]S.Geva, and J.Sitte, A Constructive Method of Multivariate Function Approximation by MultiLayer Perceptrons, IEEE Transactions on Neural Networks, 1992. [Geva and Sitte, 1994] S.Geva and J.Sitte, Constrained Gradient Descent, Proceedings of the 5th Australian Conference on Neural Computing, Brisbane Australia, 1994. [Lapedes and Farber, 1987] A.Lapedes and R.Faber, How Neural Nets Work, Advances in Neural Information Processing Systems, Anderson D.Z.(ed), American Institute of Physics, New York, 442-456, 1987. [Saito and Nakano, 1988] K.Saito and R.Nakano. Medical Diagnostic Expert System Based on PDP Model, Proceedings of the IEEE Conference on Neural Networks, IEEE Press, San Diego CA, 255-262, 1988. [Thrun, 1991] S.Thrun et.al., The MONK s Problems - A Performance Comaprison of Different Learning Algorithms, Carnegie Mellon University CMU-CS-91-197, December 1991. [Towell and Shavlik, 1993] G.Towell and J.Shavlik, Extracting Refined Rules From Knowledge Based Neural Networks, Machine Learning, 3(1), 71-101, 1993.

Suggest Documents