Rule Extraction From Local Cluster Neural Nets - CiteSeerX

3 downloads 1753 Views 107KB Size Report
Faculty of Information Technology. Queensland University of Technology ... RULEX extracts symbolic rules from the weights of a trained LC net. LC nets are a.
Rule Extraction From Local Cluster Neural Nets Robert Andrews & Shlomo Geva Machine Learning Research Centre Faculty of Information Technology Queensland University of Technology GPO Box 2434, Brisbane. Q 4001, Australia. [email protected] [email protected] Abstract This paper describes RULEX, a technique for providing an explanation component for Local Cluster (LC) neural networks. RULEX extracts symbolic rules from the weights of a trained LC net. LC nets are a special class of multilayer perceptrons that use sigmoid functions to generate localised functions. LC nets are well suited to both function approximation and discrete classification tasks. The restricted LC net is constrained in such a way that the local functions are ‘axis parallel’ thus facilitating rule extraction. This paper presents results for the LC net on a wide variety of benchmark problems and shows that RULEX produces comprehensible, accurate rules that exhibit a high degree of fidelity with the LC network from which they were extracted.

Introduction In [1] Geva et.al. describe the Local Cluster (LC) network, a sigmoidal perceptron with 2 hidden layers where the connections are restricted in such a way that clusters of sigmoids form local response functions similar to Radial Basis Functions (RBFs). They give a construction and training method for LC networks and show that these networks (i) exceed the function representation capability of generalised Gaussian networks, and (ii) are suitable for discrete classification. They also describe a restricted version of the LC network and state that this version of the network is suitable for rule extraction without however describing how this is possible. Local function networks are attractive for rule extraction for two reasons. Firstly, it is conceptually easy to see how the weights of a local response unit can be converted to a symbolic rule. Local function units are hyper-ellipsoids in input space and can be described in terms of a reference vector that represents the centre of the hyper-ellipsoid and a set of radii that determine the effective range of the hyper-ellipsoid in each input dimension. The rule derived from the local function unit is formed by the conjunct of these effective ranges in each dimension. Rules extracted from each local function unit are thus propositional and of the form: IF æ 1 # i # n : xi 0 [ xi lower , xi upper ] THEN pattern belongs to the target class

…(1)

where [ xi lower , xi upper ] represents the effective range in the ith input dimension. Secondly, because each local function unit can be described by the conjunct of ranges of values in each input dimension it makes it easy to add units to the network during training such that the added unit has a meaning that is directly related to the problem domain. In networks that employ incremental learning schemes a new unit is added when there is no significant improvement in the global error. The unit is chosen such that its reference

Submitted to Neurocomputing, Feb 2000

Page 1 of 17

vector, ie, the centre of the unit, is one of the as yet unclassified points in the training set. Thus the premise of the rule that describes the new unit is the conjunction of the attribute values of the data point with the rule consequent being the class to which the point belongs. In recent years there has been a proliferation of methods for extracting rules from trained artificial neural networks. (See [2] [3] for surveys of the field.) While there are many methods for extracting rules from specialised networks the majority of techniques focus on extracting rules from MLPs. There are a small number of published techniques for extracting rules from local basis function networks. Tresp, Hollatz & Ahmad [4] describe a method for extracting rules from gaussian RBF units. Berthold & Huber [5] [6] describe a method for extracting rules from a specialised local function network, the RecBF network. Abe & Lan [7] describe a recursive method for constructing hyper-boxes and extracting fuzzy rules from same. Duch et. al. [8] describes a method for extraction, optimisation and application of sets of fuzzy rules from ‘soft trapezoidal’ membership functions which are formed using a method similar to that described in this paper. In this paper we briefly describe the restricted LC network and introduce the RULEX algorithm for extracting symbolic rules from the weights of a trained, restricted LC neural net. The remainder of this paper is organised as follows. Section 2 describes the restricted LC net. Section 3 describes the ADT taxonomy [2][3] for classifying rule extraction techniques and introduces the RULEX algorithm. Section 4 presents comparative results for the LC, Nearest Neighbour and C4.5 techniques on some benchmark problems. Section 5 presents an assessment of RULEX in terms of the rule quality criteria of the ADT taxonomy. The paper closes with section 6 where we put our main findings into perspective. 2. The Restricted Local Cluster Network Geva et. al. [1] show that a region of local response is formed by the difference of two appropriately parameterised, parallel, displaced sigmoids. l ( w, r , x ) = l + (w, r , x ) − l − ( w, r , x )

= σ (k1 , w T ( x - r ) + 1) − σ (k1 , w T ( x - r ) − 1)

…(2)

The ‘ridge’ function l in equation (2) above is a function that is almost zero everywhere except in the region between the steepest part of the two logistic sigmoid functions^. The parameter r is a reference vector; the width of the ridge is given by the reciprocal of |w|; and the value of k1 determines the shape of the ridge, (which can vary from a rectangular impulse for large values of k1 to a broad bell shape for small values of k1).

^

where σ (k , h) =

1 1+ e

− kh

Submitted to Neurocomputing, Feb 2000

Page 2 of 17

Adding n ridge functions l with different orientations but a common centre produces a function f that peaks at the centre where the ridges intersect. but with the component ridges radiating on both sides of the centre. To make the function local these component ridges must be ‘cut off’ without introducing discontinuities into the derivatives of the local function. The function n

f ( w, r , x ) = ∑ l ( w i , r , x )

…(3)

i =1

is the sum of the n ridge functions and the function

/ ( w, r , x )

= σ 0 (k 2 , f (w , r , x ) − d )

…(4) eliminates the unwanted regions of the radiating ridge functions when d is selected to ensure that the maximum value of the function f , located at x = r , coincides with the centre of the linear region of the output sigmoid σ 0 ^. The parameter k2 determines the steepness of the output sigmoid σ 0 . Geva et.al. [1] show that a target function y*(x) can be approximated by a function y(x) which is a linear combination of n local cluster functions with centres r: distributed over the domain of the function. The expression

y( x ) =

m

v µ / ( w µ , rµ , x ) ∑ µ

...(5)

=1

then describes the generalised LC network where v: is the output weight associated with each of the individual local cluster functions / (Network output is simply the weighted sum of the outputs of the local clusters.) In the restricted version of the network the weight matrix w is diagonal. w i = (0,...wi ,...,0)

i = 1...n

…(6)

Which simplifies the functions l and f as follows l ( w i , r , x ) = σ (k1 , wi ( x i − ri ) + 1) − σ (k1 , wi ( x i − ri ) − 1)

…(7)

n

f (w i , r , x ) = ∑ l (wi , ri , x i )

…(8)

i =1

^

The value of d is given as d = n (

1 −k 1+ e 1

Submitted to Neurocomputing, Feb 2000



1 k 1+ e 1

) where n is the input dimensionality.

Page 3 of 17

One further restriction is applied to the LC network in order to facilitate rule extraction, viz, the output weight, v µ , of each local cluster is held constant ∗. This measure prevents local clusters ‘overlapping’ in input space thus allowing each local cluster to be individually decompiled into a rule. The final form of the restricted LC network for binary classification tasks is given by m

y ( x ) = ∑ 2 / ( w µ , rµ , x )

…(9)

µ =1

For multiclass problems several such networks can be combined, one network per class, with the output class being the maximum of the activations of the individual networks. The LC network is trained using gradient descent on an error surface. The training equations are given in Geva et.al. [1] and need not be reproduced here. 3. A Taxonomy for Classifying Rule Extraction Techniques Andrews et.al.[2] describe the ADT taxonomy for describing rule extraction techniques. This taxonomy was refined in Tickle et.al. [3] to better cater for the profusion of published techniques for eliciting knowledge from trained neural networks. The taxonomy consists of five primary classification criteria, viz. a) the expressive power (or, alternately, the rule format) of the extracted rules; b) the quality of the extracted rules; c) the translucency of the view taken within the rule extraction technique of the underlying neural network; d) the complexity of the rule extraction algorithm; e) the portabililty of the rule extraction technique across various neural network architectures (i.e. the extent to which the underlying neural network incorporates specialised training regimes). The expressive power of the rules describes the format of the extracted rules. Currently there exist rule/knowledge extraction techniques that extract rules in various formats including propositional rules [9][10[11], fuzzy rules [12][13], scientific laws [14], finite state automata [15], decision trees [16], and m-of-n rules [17]. The rule quality criterion is assessed via four characteristics, viz. a) rule accuracy, the extent to which the rule set is able to classify a set of previously unseen examples from the problem domain; b) rule fidelity, the extent to which the extracted rules mimic the behaviour of the network from which they were extracted;



The maximum value of / is 0.5. Hence for classification problems where the taret values are{0,1} it is

appropriate to set v µ = 2

Submitted to Neurocomputing, Feb 2000

Page 4 of 17

c) rule consistency, the extent to which, under differing runs of the rule extraction algorithm, rule sets are generated which produce the same classifications of unseen examples; d) rule comprehensibility, the size of the extracted rule set in terms of the number of rules and number of antecedents per rule. The translucency criterion categorises a rule extraction technique according to the granularity of the neural network assumed by the rule extraction technique. Andrews et.al. [2] use three key identifiers to mark reference points along a continuum of granularity from decompositional (rules are extracted at the level of individual hidden and output layer units) to pedagogical (the network is treated as a ‘black box’; extracted rules describe global relationships between inputs and outputs; no analysis of the detailed characteristics of the neural network itself is undertaken). The algorithmic complexity of the rule extraction technique provides a useful measure of the efficiency of the process. It should be noted however that few authors in the surveys [2][3] reported or commented on this issue. The portability criterion assessed ANN rule-extraction techniques in terms of the extent to which a given technique could be applied across a range of ANN architectures and training regimes. Currently there is a preponderance of techniques that might be termed specific purpose techniques i.e. those where the rule extraction technique has been designed specifically to work with a particular ANN architecture. A rule extraction algorithm that is tightly coupled to a specific neural network architecture has limited use unless the architecture can be shown to be applicable to a broad cross section of problem domains. 3.1 The RULEX Technique RULEX is a decompositional technique that extracts propositional rules of the form given in (1) above. As such the imperative is to be able to determine [xi lower , xi upper] for each input dimension i of each local cluster. This section describes how values can be determined. Equation (7) can be rewritten l ( w i , r , x ) = σ (k i , ( x i − ri + bi )) − σ (k i , ( x i − ri − bi ))

…(10) where ki = k1/wi and bi = 1/wi. Here ki represents the shape of an individual ridge and bi represents the width of the individual ridge. From equation (4) we see that the output of a local cluster unit is determined by the sum of the activations of all its component ridges. Therefore, the minimum possible activation of an individual ridge, the ith ridge say, in a local cluster unit that has activation barely greater than its classification threshold, will occur when all ridges other than the ith ridge have maximum activation.

Submitted to Neurocomputing, Feb 2000

Page 5 of 17

We define the functions min(•) and max(•) as the minimum and maximum values respectively of their function arguments.

1 − 1) / k 2 …(11) OT where OT is the activation threshold of the local cluster and max(l(wb,rb,xb)) is the maximum possible activation for any ridge function in the local cluster. As k2, and OT are constants, and max(l(wb,rb,xb)) can be calculated, the value of the minimum activation of the ith ridge, min(l(wb,rb,xb)), can be calculated in a straightforward manner. See Appendix A for the derivation of equation (11). min(l ( wi , ri , x i )) = max(l ( wb , rb , x b )) − ln(

Let " = min(l(wb,rb,xb)), m = e −( xi −ri ) ki , and n = e we have

α=

− bi k i

. From equations (10) and (11)

1 1 − 1 + mn 1 + m / n

...(12)

Let p = (1 − α )e bi ki and q = (α + 1)e −bi ki . Solving equation (12) for m and backsubstituting for m and n gives p 2 + q 2 − 2(α 2 + 1)

(p−q±

x i = ri − ln



)k

−1 i

…(13)

See Appendix B for the derivation of equation (13). Thus for the ith ridge function the extremities of the active range, [xi lower , xi upper] are given by the expressions

x i lower = ri −

x i upper = ri +

β lower ki

…(14)

β upper ki

…(15)

where $lower is the negative root of the ln(•) expression in equation (13) above and $upper is the positive root. 3.2 Simplification of Extracted Rules One of the main purposes of rule extraction from neural networks is to provide an explanation facility for the decisions made by the network. As such it is clearly important the extracted rules be as comprehensible as possible. The directly extracted rule set may contain: a) redundant rules; b) individual rules with redundant antecedent condition(s); and c) pairs of rules where antecedent conditions can be combined.

Submitted to Neurocomputing, Feb 2000

Page 6 of 17

Rule b is redundant and may be removed from the rule set if there exists a more general rule a such that

æ 1 # i # n : [ xbi lower , xbi upper ] f [ xai lower , xai upper ] A rule is also redundant and may be removed from the rule set if

ý 1 # i # n : [ i lower , i upper ] 1 [ xi lower , xi upper ] = N where [ i lower , i upper ] represents the entire range of values in the ith input dimension. An antecedent condition is redundant and may be removed from a rule if

ý 1 # i # n : [ i lower , i upper ] f [ xi lower , xi upper ] Rules a and b may be merged on the antecedent for input dimension j if

æ 1 # i # n : (i ú j) v ([ xai lower , xai upper ] = [ xbj lower , xbj upper ]) RULEX implements facilities for simplifying the directly extracted rule set in order to improve the comprehensibility of the rule set. The simplification is achieved without compromising the accuracy of the rule set. 4. Comparative Results The restricted LC network has been applied to a variety of datasets available from the machine learning repository at Carnegie Mellon University. These datasets were selected to show the general applicability of the network. The datasets contain missing values, noisy data, continuous and discrete valued attributes, a mixture of high and low dimensionality, and a variety of binary and multi-class classification tasks. Table 1 below summarises the problem domains used in this study. Domain annealing processes auto insurance breast cancer (Wisconsin) horse colic credit screening (Aus) Pima diabetes glass identification heart disease (Cleveland) heart disease (Hungarian) hepatitis prognosis iris classification labor negotiations sick euthyroid sonar classification

Cases Number of Number of Continous Discrete Missing Classes Attributes Valued Data Valued Data Values 898 6 38 205 6 25 699 2 9 368 2 22 690 2 15 768 2 8 214 6 9 303 2 13 294 2 13 155 2 19 150 3 4 57 2 16 3772 2 29 208 2 60

U U X U U U U U U U U U U U

U U U U U X X U U U X U U X

U U U U U X X U U U X U U X

Table 1 – Summary of problem domains used in the study

Submitted to Neurocomputing, Feb 2000

Page 7 of 17

A variety of methods including Linear Regression, Leave-One-Out Nearest Neighbour (LOONN), Cross Validation Nearest Neighbour (XVNN), and C4.5 were chosen to provide comparative results for the restricted LC network. Linear regression was used to obtain a base line for comparison. The nearest neighbour methods were chosen because they are simple forms of local function classifiers. C4.5 was chosen because it is widely used in machine learning as a benchmarking tool. Further C4.5 is an example of an ‘axis parallel’ classifier and as such it provides an ideal comparison for the LC network. Ten fold cross validation results are shown in Table 2 below. Figures quoted are average percentage error rates. Domain annealing processes auto insurance breast cancer (Wisconsin) horse colic credit screening (Aus) Pima diabetes glass identification heart disease (Cleveland) heart disease (Hungarian) hepatitis prognosis iris classification labor negotiations sick euthyroid sonar classification

LOONN XVNN 11.9 17.6 4.4 19.3 19.5 29.6 30.4 23.8 23.6 23.8 4.7 19.1 3.9 13.2

10.9 16.5 4.7 18.9 19 29.3 30.8 23.8 22.8 24.7 4 19.3 3.7 13.9

Linear Regression 77.5 4.2 53.1 50 22.4 37.6 52.7 40.3 44 16.7 32 22.5

C4.5

LC

RULEX

7.67 17.7 5.26 15 14.7 25.4 32.5 23 21.5 20.4 4.8 19.1 1.34 25.6

2.51 15.6 3.15 13.5 12.9 22.64 39.05 15.8 14.6 16.2 4.67 8.3 7.8 15.38

16.23 27 5.72 14.1 15.65 27.35 19.8 18.7 21.3 6 12.3 7.58 21.52

Table 2 – Summary of results for selected problem domains and methods

These results show that in the majority of cases the restricted LC network produces results that are at least comparable to those obtained by C4.5 and, on 12 of the 14 datasets studied, the LC network produces results better than those obtained by C4.5. Further, the results also show that RULEX, even though its primary purpose is explanation not classification, is able to extract accurate rules from the trained network, i.e. rules that provide a high degree of accuracy when used to classify previously unseen examples.

5. RULEX and the ADT Taxonomy This section places the RULEX algorithm into the classification framework of the ADT taxonomy presented in section 3. a) Rule Format. From (1) it can be seen that RULEX extracts propositional rules. In the directly extracted rule set each rule contains an antecedent condition for each input dimension as well as a rule consequent which describes the output class covered by the rule. As mentioned in section 3.2 RULEX provides a rule simplification process which removes redundant rules and antecedent conditions from the directly extracted rules. The reduced rule set contains

Submitted to Neurocomputing, Feb 2000

Page 8 of 17

rules that consist of only those antecedents that are actually used by the trained LC network in discriminating between input patterns. b) Rule Quality As stated previously, the prime function of rule extraction algorithms such as RULEX is to provide an explanation facility for the trained network. The rule quality criteria provide insight into the degree of trust that can be placed in the explanation. Rule quality is assessed according to the accuracy, fidelity, consistency and comprehensibility of the extracted rules. Table 3 below presents data that allows a quantitative measure to be applied to each of these criteria. Domain annealing processes auto insurance breast cancer (Wisconsin) horse colic credit screening (Aus) Pima diabetes glass identification heart disease (Cleveland) heart disease (Hungarian) hepatitis prognosis iris classification labor negotiations sick euthyroid sonar classification

LC Error 2.5% 15.6% 3.2% 13.5% 12.9% 22.6% 39.1% 15.8% 14.6% 16.2% 4.7% 8.3% 7.8% 15.4%

RULEX Error 18.2% 27.0% 5.6% 14.1% 15.7% 27.4% 42.5% 19.8% 18.7% 21.3% 6.0% 12.3% 7.6% 21.5%

Local Clusters 16 60 5 5 2 5 22 4 3 6 3 2 4 4

Rules 16 57 5 2.5 2 5 19 3 2 4 3 2 4 3

Antecedents per Rule 20 13 24 8 5 5 6 5 5 8 3 7 5 8

Fidelity 83.9% 86.5% 97.5% 99.3% 96.8% 93.9% 94.3% 95.3% 95.2% 93.9% 98.6% 95.6% 99.7% 92.7%

Table 3 – Rule Quality Assessment i) Accuracy Despite the mechanism employed to avoid local cluster units ‘overlapping’ during network training (see equation (9)) it is clear that there is some degree of interaction between local cluster units. (The larger the values of the parameters k1 and k2 the less the interaction between units but the slower the network training.) This effect becomes more apparent in problem domains with high dimension input space and in network solutions involving large numbers of local cluster units. Further, RULEX approximates the hyper-ellipsoidal local cluster functions of the LC network with hyper-rectangles. It is therefore not surprising that the classification accuracy of the extracted rules is less than that of the underlying network. It should be noted however that while the accuracy figures quoted for RULEX are worse than the LC network they are comparable to those obtained from C4.5. ii) Fidelity Fidelity is closely related to accuracy and the factors that affect accuracy, viz interaction between units and approximation of hyper-ellipsoids by hyperrectangles also affect the fidelity of the rule sets. In general, the rule sets

Submitted to Neurocomputing, Feb 2000

Page 9 of 17

extracted by RULEX display an extremely high degree of fidelity with the LC networks from which they were drawn. iii)Consistency Rule extraction algorithms that generate rules by querying the trained neural network with patterns drawn randomly from the problem domain [16][20] have the potential to generate a variety of different rule sets from any given training run of the neural network. Such algorithms have the potential for low consistency. RULEX on the other hand is a deterministic algorithm that always generates the same rule set from any given training run of the LC network. Hence RULEX always exhibits 100% consistency. iv) Comprehensibility In general, comprehensibility is inversely related to the number of rules and to the number of antecedents per rule. The LC network is based on a greedy, covering algorithm. Hence its solutions are achieved with relatively small numbers of training iterations and are typically compact, i.e. the trained network contains only a small number of local cluster units. Given that RULEX converts each local cluster unit into a single rule, the extracted rule set contains, at most, the same number of rules as there are local cluster units in the trained network. The rule simplification procedures built into RULEX potentially reduces the size of the rule set and ensures that only significant antecedent conditions are included in the final rule set. This leads to extracted rules with as high comprehensibility as is possible. c) Translucency RULEX is distinctly decompositional in that rules are extracted at the level of the hidden layer units. Each local cluster unit is treated in isolation with the local cluster weights being converted directly into a rule. d) Algorithmic Complexity Golea [18][19] showed that, in many cases, the computational complexity of extracting rules from trained ANNs and the complexity of extracting the rules directly from the data are both NP-hard. Hence the combination of ANN learning and ANN rule-extraction potentially involves significant additional computational cost over direct rule-learning techniques. Table 4 in appendix C gives an outline of the RULEX algorithm. Table 5 in appendix C expands the individual modules of the algorithm. From these descriptions it is clear that the majority of the modules are linear in the number of local clusters (or rules) and the number of input dimensions, O(lc x n). The modules associated with rule simplification are, at worst, polynomial in the number of rules, O(lc2). RULEX is therefore computationally efficient and has some significant advantages over rule extraction algorithms that rely on a (potentially exponential) ‘search and test’ strategy [10][17]. Thus the use of RULEX to include an explanation facility adds little in the way of overhead to the neural network learning phase. e) Portability RULEX is non-portable having been specifically designed to work with local cluster (LC) neural networks. This means that it cannot be used as a general purpose device for

Submitted to Neurocomputing, Feb 2000

Page 10 of 17

providing an explanation component for existing, trained, neural networks. However, as has been shown in the results presented in section 4, the LC network is applicable to a broad range of problem domains (including continuous valued, discrete valued domains and domains which include missing values). Hence RULEX is also potentially applicable to a broad variety of problem domains. Conclusion This paper has described the restricted form of the LC local cluster neural network and the associated RULEX algorithm that can be used to provide an explanation facility for the trained network. Results were given for the LC network which show that the network is applicable across a broad spectrum of problem domains and produces results that are at least comparable, and in many cases better than C4.5, an accepted benchmark standard in machine learning. The RULEX algorithm has been evaluated in terms of the guidelines laid out for rule extraction techniques. RULEX has been shown to be a decompositional technique capable of extracting accurate and comprehensible propositional rules. Further, rule sets produced by RULEX show high fidelity with the network from which they were extracted. RULEX has been shown to be computationally efficient. Its main drawback is that is not a portable technique as it has been designed specifically to work with trained LC networks.

Bibliography [1] S. Geva, K. Malmstrom, & J. Sitte, Local Cluster Neural Net: Architecture, Training And Applications, Neurocomputing 20 (1998) 35-56. [2] R. Andrews, A.B. Tickle & J. Diederich, A Survey and Critique of Techniques for Extracting Rules From Trained Artificial Neural Networks, Knowledge Based Systems 8 (1995) 373-389. [3] A.B. Tickle, R. Andrews, M. Golea & J. Diederich, The Truth Will Come to Light: Directions and Challenges in Extracting the Knowledge Embedded Within Trained Artificial Neural Networks, IEEE Trans on Neural Networks 9:6 (1998) 1057- 1068. [4] V. Tresp, J. Hollatz, & S. Ahmad, Network Structuring and Training Using RuleBased Knowledge, Advances In Neural Information Processing Systems (NIPS*6) (1993), 871-878. [5] M. Berthold, & K. Huber, From Radial to Rectangular Basis Functions: A New Approach for Rule Learning from Large Datasets, Technical Report 15-95, (1995) University of Karlsruhe. [6] M. Berthold, & K. Huber, Building Precise Classifiers with Automatic Rule Extraction, Proceedings of the IEEE International Conference on Neural Networks, Perth, Australia, (1995), Vol 3, 1263-1268.

Submitted to Neurocomputing, Feb 2000

Page 11 of 17

[7] S. Abe & M.S. Lan, A Method for Fuzzy Rules Extraction Directly From Numerical Data and its Application to Pattern Classification, IEEE Trans on Fuzzy Systems, Vol 3 No 1, (Feb 1995), 18-28. [8] W. Duch, R. Adamczak & K Grabczewski, Neural Optimisation of Linguistic Variables and Membership Functions, Proceedings of the 6th International Conference on Neural Information Processing ICONIP’99, Perth, Australia (1999), Vol II 616-621. [9] G.A. Carpenter & A.W. Tan, Rule Extraction: From Neural Architecture to Symbolic Representation, Connection Science Vol 7 No 1, (1995), 3-27. [10] R. Krishnan, A Systematic Method for Decompositional Rule Extraction From Neural Networks, Proceedings of the NIPS*97 Rule Extraction From Trained Artificial Neural Networks Workshop, Queensland University of Technology (1996) 38-45. [11] F Maire, A Partial Order for the M-of-N Rule Extraction Algorithm, IEEE Transactions on Neural Networks Vol 8 No 6, (1997), 1542-1544. [12] Y. Hayashi, A Neural Expert System with Automated Extraction of Fuzzy IfThen Rules and its Application to Medical Diagnosis, Advances in Neural Information Processing Systems (NIPS*3), (1990), 578-584. [13] S. Horikawa, T. Furuhashi, & Y. Uchikawa On Fuzzy Modeling Using Fuzzy Neural Networks with the Back-propagation Algorithm, IEEE Transactions on Neural Networks, 3:5, (September 1992), 801-806. [14] K. Saito & R. Nakano, Law Discovery Using Neural Networks, Proceedings of the NIPS*96 Rule Extraction From Trained Artificial Neural Networks Workshop, Queensland University of Technology (1996) 62-69. [15] L. Giles & C. Omlin, Rule Revision with Recurrent Networks, IEEE Transactions on Knowledge and Data Engineering Vol 8 No 1, (1996), 183-197. [16] M. Craven, Extracting Comprehensible Models From Trained Neural Networks, PhD Thesis, University of Wisconsin, Madison Wisconsin, (1996). [17] G. Towell & J. Shavlik, The Extraction of Refined Rules From Knowledge Based Neural Networks, Machine Learning, Vol 131 (1993), 71-101. [18] M. Golea, On The Complexity Of Rule Extraction From Neural Networks And Network Querying, Proceedings of the Rule Extraction From Trained Artificial Neural Networks Workshop, Society For the Study of Artificial Intelligence and Simulation of Behavior Workshop Series (AISB’96) University of Sussex, Brighton, UK (April 1996) 51-59.

Submitted to Neurocomputing, Feb 2000

Page 12 of 17

[19] M. Golea, On The Complexity Of Extracting Simple Rules From Trained Neural Nets, To appear (1997). [20] S. Thrun. Extracting Provably Correct Rules From Artificial Neural Networks, Technical Report IAI-TR-93-5, Institut for Informatik III Universitat Bonn, Germany, (1994). Appendix A – Derivation of Minimum Ridge Activation Let L = l (wi , ri , x i )

OT = σ (k 2 , (n − 1)max( L) + min( L) − d )

…(16)

Now d = n.max( L)

OT = σ (k 2 , min ( L) − max ( L))

OT =

…(17)

1 1 + e −k2 ((min ( L)−max( L ))

…(18)

e −k2min ( L) 1 − 1 = −k max ( L ) OT e 2

(

…(19)

1 )e − k2max ( L ) = e − k2min ( L ) OT − 1

ln(

…(20)

1 − 1) − k 2max ( L) = − k 2min ( L) OT

min( L) = max( L) − ln(

…(21)

1 − 1) / k 2 OT

…(22)

Appendix B – Derivation of Range of Activation For the ith Ridge Function Let " = min(l(wb,rb,xb)), m = e −( xi − ri ) , and n = e have

α = σ (ki , mn) − σ (ki , m / n)

Submitted to Neurocomputing, Feb 2000

bi

. From equations (10) and (11) we

...(23)

Page 13 of 17

α=

1 1 − 1 + mn 1 + m / n

α=

(1 + m / n) − (1 + mn) (1 + mn)(1 + m / n)

…(24)

…(25)

α (1 + mn)(1 + m / n) = (1 + m / n) − (1 + mn)

…(26)

[α (1 + m / n) + 1](1 + mn) = (1 + m / n)

…(27)

αm + 1)(1 + mn) = (1 + m / n) n

…(28)

α m 2 + (α + 1)mn + (α − 1)m / n + α = 0

…(29)

(α +

α m 2 + [(α + 1)n + (

α −1 )]m + α = 0 n

…(30)

Let a = ", b = [("+1)n + ("-1)/n], and c = ". Solving for m gives roots at

m =

(α − 1) / n − (α + 1)n ± (α + 1) 2 n 2 + 2(α + 1)(α − 1) + ((α − 1) / n) 2 − 4α 2 2α

m =

(α − 1) / n − (α + 1)n ± ((α − 1) / n)2 + (α + 1)2 n 2 − 2(α 2 + 1) 2α

…(31)

…(32)

Now n = e −bi ki . Substituting for n into (32) gives m=

(α − 1)ebi k i − (α + 1)e − bi ki ±

Submitted to Neurocomputing, Feb 2000

(α − 1) 2 e 2bi ki + (α + 1) 2 e−2bi ki − 2(α 2 + 1) 2α

…(33)

Page 14 of 17

Let p = (1 − α )e bi ki and q = (α + 1)e −bi ki . m =

p − q ± p 2 + q 2 − 2(α 2 + 1) 2α

…(34)

Now m = e − ( x i − ri ) k i . This gives the following as an expression for xi.

x i = ri −

ln(m) ki

…(35)

Appendix C - Algorithmic Complexity of RULEX rulex() { create_data_structures(); create_domain_description(); for each local cluster for each ridge function calculate_ridge_limits(); while redundancies remain remove_redundant_rules(); remove_redundant_antecedents(); merge_antecedents(); endwhile; feed_forward_test_set(); display_rule_set(); } //end rulex

Table 1 – The RULEX algorithm

Submitted to Neurocomputing, Feb 2000

Page 15 of 17

remove_redundant_rules() { OKtoremove = false; for each rule a for each other rule b for each dimension i if ([ xbi lower , xbi upper ] f [ xai lower , xai upper ]) OR ( [ i lower , i upper ] 1 [ xi lower , xi upper ] = N ) then Oktoremove = true; if Oktoremove then remove_rule_b(); } //end remove_redundant_rules remove_redundant_antecedents() { for each rule for each dimension i if [ i lower , i upper ] f [ xi lower , xi upper ] then remove_antecedent_i(); } //end remove_redundant_antecedents merge_antecedents() { OKtomerge = true; for each rule a { for each other rule b for each dimension i for each other dimension j if NOT ([ xai lower , xai upper ] = [ xbj lower , xbj upper ]) then OKtomerge = false; if Oktomerge then { [ xai lower , xai upper ] =[ xai lower , xai upper ] c [ xbj lower , xbj upper ] remove_rule_b(); } } } //end merge_antecedents

Submitted to Neurocomputing, Feb 2000

Page 16 of 17

feed_forward_test_set() { errors = 0; correct = 0; for each pattern in the test set for each rule { classified = true; for each dimension I if testi Ø [ xi lower , xi upper ] then classified = false; if classified v testtarget = ruleclasslabel then ++correct; else ++errors; } //end feed_forward_test_set display_rule_set() { for each rule { for each dimension i write([ xi lower , xi upper ]); write(ruleclasslabel); } } //end display_rule_set

Table 2 – Modules of the RULEX algorithm

Submitted to Neurocomputing, Feb 2000

Page 17 of 17

Suggest Documents