There are many papers that analyse and try to interpret a trained neural network. Tickle and Andrews [1] present a review of ANN rule extraction techniques and ...
An algorithm of knowledge extraction from trained neural networks Tadeusz Wieczorek and Slawomir Golak The Silesian University of Technology, Departament of Electrotechnology Division of Informatics and Modeling of Technological Processes
Abstract. The presented paper describes a method of knowledge extraction that is based on analysis of the trained ANN's weights The method allows to determine the significance of particular inputs, to prove their synergy as well as to find some symbolic rules, that determine the direction of influence of particular inputs.
1
Introduction
There are many papers that analyse and try to interpret a trained neural network. Tickle and Andrews [1] present a review of ANN rule extraction techniques and introduce a taxonomy for categorizing the wide variety of techniques which have been developed for extracting rules from trained networks. These methods focus on analysis of parameters (weights) of trained networks. Other approaches developed by Towell, Shavlik, Craven are based on non-standard form of rules, such as M of N (M out of N antecedents should be true) or decision trees [2]. ANNs may be considered as neurofuzzy systems and their outputs interpreted in a fuzzy logic sense using membership functions [3,4]. An overview of neural rule extraction methods [3] and a new methodology of extraction of crisp and fuzzy logical rules [2], have been given by Duch et al. Most of the papers refer to neural network classifiers with binary outputs (either for classification or pattern recognition problems). In the case of ANNs providing continuous mapping (regression problems) that are widely used for modelling of technological and economical processes, there are no universal methods for extracting the knowledge embedded within them [5].
2
Analysis of significance
We claim that, if a derivative of the network underlying function (at x) with respect to a chosen component of the input vector is on some area of the input space different from zero, then the component has got a significant influence on the network output [6]. Positive value of the derivative with respect to an input means that increasing of the input causes the increasing of an output. The output value of the entire network is a result of multiple concatenations M. A. K łopotek et al. (eds.), Intelligent Information Processing and Web Mining © Springer-Verlag Berlin Heidelberg 2004
An algorithm of knowledge extraction
471
of activation functions of particular neurons. If the activation functions are continuous and differentiable, it is possible to calculate derivatives of network underlying function:
a (i) - f' (Ni-l ) ...!!..L '" (i) (i) (1) ~ X k W jk aXn
k=O
Ni-l a (i-I) ' " Yk ~
k=1
aXn(1)
(i)
W jk
(1)
where, Xi(k) input of the i-th neuron from the k-th layer (i=O, ... ,Nk-l ; k=l, ... ,L), Wi/ k) weight of the j-th input of the i-th neuron from the k-th layer, (i=l, ... ,Nk ; k=l, ... ,L ; j=l, ... ,Nk-d and y-output. The formulae (1) should be interpret as a recurrent one: starting from layer one we calculate the output derivatives and then for following layers, according to (1). We can define the influence measure qjk (influence of the j-th network's input on the k-th network's output) as follows:
where Xj - generated input vectors (M vectors). The measure (2) could be used under the condition that the influence of an examined input (x/I)) on a network's output (Yk(L)) is one-directional (only increasing or only decreasing). Any point of the space is represented by an input vector (Xi (1)). The set of "m" randomly generated vectors should as exactly as possible cover the whole input data space. That could be done for example by uniform scanning of all features ranges [7]. The final measure, called the significance measure (Iik) has been determined taking into consideration not only the examined feature influence, but also the influence of other features, that are relative to the analyzed feature. No
2: aijqjk
j=1
Iik = ---:-::---No
2:
j=1
i=l, ... ,No k=l, ... ,NL
(3)
Iqjkl
where aij is the linear regression coefficient between i-th and j-th features.
3
Synergy analysis of features
The presented synergy analysis is based on the assumption: if the second derivative of the function, at two different inputs (features) differs from zero in some input space area, then in this area there is synergy of the features. For the synergy analysis, as well as for the significance analysis, it is important
472
Tadeusz Wieczorek et al.
the choice of activation functions for all neurons. Therefore we recommend the tangent hyperbolic activation function, at least for output neurons. A measure of two features interaction (P ijn ) is given by equation (4), but it could be used only under the condition that the way they interact has got the same direction in the entire feature space. 1 Pijn = M
M
a2y~L)(x~1))
k=l
(1) (1) aX i aX j
2..=
;
n = 1, ... , NL i,j = 1, ... , No
(4)
The final measure of synergy Sijn has been determined (9) taking into consideration some bounded features. No
L:
Sijn=
k=l
aikPkjn
I
max Sijn (i,j,n)
I;
i,j=I, ... ,No n=I, ... ,NL
(5)
where aij is the coefficient of linear regression between the features i, k.
4
Algorithm of rules generation
On the assumption that a feature does not act only in one direction, we have to generate some separate rules for any interval of monotony. We have used a genetic algorithm for the generation of rules (a similar method for generating fuzzy rules has been applied by Herrera et al [8]). Chromosomes of individuals from a population contain an encoded linguistic rule in the form of:
If {xiI)
E [aI, b1 ]/\
X~l)
E [a2, b2]/\ ... X~~ E [aNo , bNo ]}
THEN xn (1) (n=I, ... ,No) {is significant and increases an output, is insignificant, is significant and decreases } where [ai,b i ] intervals of variability of particular features. Ranges of particular arguments (features) are coded inside a chromosome as real numbers, and classes of reaction of a feature Xi(l) (decrease, increase, neutral) - discretely. A fitness function, that expresses the precision of a rule represented by an individual chromosome has been determined according to: p = nr - nf
(6)
where nr is the number of cases that have been right described by the rule, and nf the number of cases wrong classified. Evolution will tend to generalization of rules, unless precision decreases. After determining a rule, all the elements that have been classified right are eliminated from the set of cases. Then the evolution process is started again. The algorithm ends either the cases deplete or generated rules are too detailed. The described algorithm is carried out separately for any feature.
An algorithm of knowledge extraction
473
If the interaction between two features in different areas of the input space vary then, like the significance analysis, described genetic algorithm will be used for the following linguistic rules generation:
(7) THEN {between
xn(l), xm(l) there is synergy and they weaken one another, are neutral to each other, intensify one another}
5
Computer experiment
For verification of the presented method, an experiment has been carried out. The testing network (a multilayer percept ron with 71 inputs and one output) has got an input layer and three processing layers. All the neurons have got tangentional activation functions. The network has been trained with data calculated according to (9), thus the network can be considered as a neural model of the process described by (9):
y
=
2Xl
+ sin(1fx2) + COS(1fX3) + 3X4X5 + X6 COS(X7)
(8)
i=1, ... ,7
wherexiE[O,l]
It is obvious that only 7 features (Xl - X7) are important. All the rest of 64 input parameters were completely wrong and should not have had any influence for the output y. In the formulae (9) there is synergy both, between X4 and X5 and X6,X7 as well as bi-directional influence of X3. Table 1. Significanse analysis of the testing model No
Parameter
1
X3
2
Xl
3
X4
4
X5
5
X6
6
X7
7
X2
8
X34
9
XlO
10
X33
Direction of influence
Significance 15,82
+ + + + +
14,04 13,30 12,33 5,31 4,00 1,98
+
1,79 1,31
+
1,20
Iik
474
Tadeusz Wieczorek et al.
The significance analysis proved that inputs X1,X2,X3,X4,X5,X6,X7 are important because of placing at the beginning of the significance table. Using the genetic algorithm we generated following rules: IF x2>0,5 THEN X2 causes decrease and for x2