Inductive knowledge acquisition using feedforward ... - Semantic Scholar

1 downloads 0 Views 186KB Size Report
The First-monk 10] problem domain is used to demonstrate the whole pro- cess in detail and then several real-world domains are considered to demonstrate.
Inductive knowledge acquisition using feedforward neural networks with rule-extraction

Richi Nayak

Joachim Diederich

Frederic Maire

Machine Learning Research Centre Queensland University of Technology Brisbane Qld 4001, Australia [email protected]

Abstract. At the present state of art, generating a rule base is one of

the main challenges in the area of knowledge-based systems. The present work attempts to automate the parts of the process of knowledge acquisition by using neural networks with rule extraction techniques. This paper presents a methodology composed of four phases to generate a representation formalism based on quanti ed rules of n-ary predicates, facts and type hierarchy. Predicate rules extracted from neural networks have been used successfully to initialise SHRUTI reasoning system. The automated knowledge base enables the greater explanatory capabilities by allowing user interaction. Moreover, empirical results demonstrate that these predicate rules extracted from neural networks have a high accuracy.

1 Introduction The exercise of knowledge acquisition, where a domain expert's knowledge is extracted and encoded into an expert system, has developed into a new eld of knowledge engineering. The knowledge of an expert system is often presented in rules and implemented via knowledge engineering methods. A task, induction of rules from the classi ed examples that can be mastered by the computers instead of the human experts, is in the progress within the machine learning eld. Neural networks are a powerful general purpose tool applied to machine learning tasks such as classi cation, prediction and clustering. Neural networks have been successfully applied to many decision-support applications [9]. The ability to learn and generalise from the data which mimics the humans capability to learn from experience, makes neural networks useful to automate the process of knowledge acquisition. A recognised drawback of neural networks is the absence of the capability to explain their decision process in a comprehensive form. One of the most promising approach to overcome this problem is to translate the incomprehensible form of numerical weights (the knowledge stored in the network) into symbolic rules. Rule extraction from neural networks can help to explain their behaviour and also facilitates the transfer of learning. Several e ective algorithms have been developed to extract rules from a trained neural network in the past few years [1].

Most of the algorithms have concentrated on representing the extracted knowledge in propositional logic format. Previous studies have shown that predicate rules (with variables) have more expressive power and a better understandability than propositional (without variable) rules [6]. There has not been much work done on the predicate rules representation using a connectionist approach. Neural networks with rule extraction techniques can improve the process of knowledge acquisition by generating a rule-base (the core of an expert system ). In this paper, we present a methodology composed of four phases to extract the predicate rules suitable for SHRUTI[8]. The process starts with the construction and training of a feed-forward neural network using the cascade algorithm [2] for the inductive acquisition of concepts from the examples. A clustering based pruning algorithm is developed and applied to the network after training. In the second phase, the pruning process removes the irrelevant nodes and redundant links from a network. The third phase extracts the rules from a pruned network into an equivalent set of DNF (disjunctive normal form) expressions using two existing techniques: LAP at the level of individual hidden and output units [4]; and RuleVI that characterises the target concept directly in terms of inputs [3]. In the last phase, the DNF expressions are generalised and transferred into the predicate rules, which form a knowledge base suitable to SHRUTI representation formalism. The automated knowledge base is used for the forward and backward reasoning in the knowledge base system SHRUTI and enables the greater explanatory capabilities by allowing user interaction. The First-monk [10] problem domain is used to demonstrate the whole process in detail and then several real-world domains are considered to demonstrate the e ectiveness of the methodology. The empirical results show that the pruning of a trained network helps to generate a compact rule-set by only considering the low-dimensional search space. The results also show that the predicate rules can be obtained by a connectionist approach with high accuracy and a knowledge base is successfully automated to reason with an inference tool.

2 A Methodology to generate predicate rules The four phases to learn and generate the rules with variables and n-ary predicates from a neural network are summarised into Table 1 and discussed in the following sections.

2.1 Phase1: Neural network training The present approach is in general independent of the underlying feedforward network architecture. However we utilise the cascade correlation algorithm [2] to build a network for a given problem. The initial network starts with an input and an output layer with no hidden units. The cascade correlation constructs a network by initially training the output unit to approximate the target function. When the training stagnates, a pool of candidate units is trained with the connections from all inputs and the previously inserted candidate units to predict

Table 1. A methodology to generate a formalism based on the predicate rules Phase 1. Select and train a network until it reaches the minimum validation error. Phase 2. Start pruning the network to remove the redundant links and nodes. Phase 3. Extract the knowledge into a set of DNF expressions from the network: 3.1 apply LAP at the level of each individual hidden and output units. 3.2 apply RuleVI to characterise the target concept directly in terms of inputs. Phase 4. Translate the propositional rules into predicate rules. Also generate a representation consisting type-hierarchy, facts and the predicate rules.

the network error. When the training of candidate units stagnates, the best unit which minimises the error most, is inserted into the network. A connection is added from the inserted unit to the output unit. The weights coming from the inputs to the inserted units are then frozen and the training of output unit with the connections to input units and all the inserted hidden units are repeated. This process continues until an acceptable overall network is achieved. To encourage the pruning of nodes and links, a few soft constraints are imposed upon the network during training. The idea is to prune out the small weights during the training without much e ect to large weights. This is done by adding a penalty term to the error function during training i.e. associating a cost with each connection in the network. The modi ed cost function [12] is the sum of two terms:

(S; w) =

X(target ? actual output ) +   X p

p

2S

2

i;j

p

w2 1 + w2 ij

ij

The rst term is the standard sum squared error over a set of examples S . The second term describes a cost for each weight in the network and is a complexity term. The cost is small when w is close to zero and approaches unity (times ) as the weight grows. Initially  is set to zero and gradually increased by small steps. The learning rule then updates the weights according to the gradient of the modi ed cost function with respect to weights. The updated weights are: ij

w = w + w ? decay term ij

ij

ij

decay term =   (1 +2ww2 )2 ij

ij

The decay term allows the smaller weights to decay faster as compared to the larger weights. Also all the weights except the bias weights are constrained to two digits after the decimal point. As in the cascade correlation (with the sigmoid activation function), the unit output is constrained to a small interval. This constraint makes sure that the small weights are already set to zero during training and limits the search space.

2.2 Phase1: Neural network pruning In order to eciently extract rules from a neural network, heuristics are needed to guide the extraction process. An important point to guide the search is the quality of the attributes. When a large number of attributes are involved in the problem domain, some relevant attributes may become redundant in the presence of other attributes. Reducing the number of attributes not only speeds up the extraction process, but also prevents the generation of an inferior rule set by the presence of many irrelevant attributes.

Table 2. The pruning algorithm. A metric distance based clustering method is applied to nd the best n-way partition such that the metric di erence among all the elements in a cluster is less than the distance measure provided by user. 1. For each non-input neuron in the network: 1.1 Group the network's links of similar weights in clusters; 1.2 For each cluster: 1.2.1 Set the weight of each element to the average weight of that cluster; 1.2.2 Test the cluster's magnitude against the bias weight;  If the bias is larger than the cluster's magnitude, then the cluster is marked as unnecessary; 1.3 Sequentially represent all the training examples to the network. 1.3.1 For each cluster:  Set all the relevant weights to zero;  Label the cluster as unnecessary if there is no qualitative change in the network's prediction and the accuracy of the network is maintained high;  Adjust all the the relevant weights to pre-set values. 2. For each non-output neuron ni in the network: 2.1 Delete all the links labelled as irrelevant; 2.2 If all the connections of ni are labelled as irrelevant then delete ni ; 2.3 Remove ni if there are no output links from it and ni is a hidden node.

In order to obtain a compact set of rules, the dimension of the input space is reduced whenever possible by pruning, using a heuristically guided decompositional rule extraction technique similar to the MofN algorithm [11]. The pruning algorithm removes the network connections according to the magnitude of the weights. The links with the suciently low weights (less than the threshold) are deemed inconsequential as they are not decisive for the neuron's activation state. Consequently, low weighted links are not retained for the symbolic rules and can be removed from the network. The precise steps to prune the irrelevant links and redundant nodes in a network are summarised in Table 2. Only the small weights that are not a ecting the network performance are removed during the pruning. The network has also maintained the high accuracy during the elimination of each individual cluster. As a result, only a few epochs are needed to train the remaining links. The pruned network is further trained using the quickprop [2]. Finally the set of trained weights, a subset of

the training examples (consisting the remaining attributes) and another set of instances (query instances) are recorded for the next phase.

2.3 Phase3: Propositional rule extraction Andrews, Diederich & Tickle [1] developed an overall taxonomy for categorising the techniques to extract rules from neural networks and proposed a total of ve primary criteria. The second classi cation criteria, translucency of the view taken within the rule extraction technique, re ects the relationship between the extracted rules and the internal architecture of the underlying neural network. This criteria comprises the two basic categories: decompositional and pedagogical rule extraction techniques and a third labelled as eclectic which combines the elements of the two basic categories. The presented methodology has considered both the decompositional and pedagogical categories to extract propositional rules.

Decompositional rule extraction technique - LAP To obtain a set of DNF expressions equivalent to each individual node in a network, we use a decompositional rule extraction method called LAP. The distinct feature of the decompositional approaches is the maximum level of granularity i.e. view the network as a set of discrete hidden and output nodes. The aim is to extract the rules at the level of each individual hidden and output node, and then aggregated to form a composite rule base that describes the network. LAP uses a recursive function that tests the sum of the largest weights from each set of attributes against the bias of each non-input unit. If the sum of the weights is larger than the bias then LAP constructs a separate case of weights for each set of attributes excluding the largest weight from a single set at a time. The process is repeated until the node fails to re (bias larger than sum of the weights). The inputs corresponding to the weights, that cause the node to have a higher output than bias, then form a basis for the DNF expression. Pedagogical rule extraction technique - RuleVI To obtain a set of DNF expressions equivalent to the whole network, we use a pedagogical rule extraction method called RuleVI. In a pedagogical approach, the network is treated as a black box. The rule extraction task is viewed as a learning task where the target concept is the function computed by the network and the input features are simply the network's input features. The objective is to nd rules that map inputs directly into the outputs. RuleVI generates a rule set by repeatedly querying a trained neural network and examining the network response. RuleVI utilises VI analysis to ask the partially speci ed queries to test a proposed rule. 2.4 Phase4: Rule generalisation Our motivation to consider the predicate rules over the propositional rules (variable free) is the greater expressiveness of former. The predicate rules allow to

learn general rules as well as the internal relationships among the variables. The di erence can be illustrated by the following example. Suppose the task is to learn the target concept wife(x,y) de ned over the pairs of people x and y. Based on one positive example: (Name = Mary, Married to = John, Sex =Female, Wife=True); a propositional rule learner will learn a speci c rule, If (Name = Mary) ^ (Married to = John) ^ (Sex = Female) Then (Wife = True). The program that allows the quanti cation in rules will learn the following general rule, If Married(x,y) ^ Female(x) Then Wife(x,y), where x and y are variables that can be bound to any person, such as binding x as Mary and y as John in the above case. The process to convert the speci c rules to the general rules (consisting the variables and n-ary predicates) is explained below.

An example In this section, the rule generalisation process is demonstrated in detail with the rst monk problem [10]. The monk1 problem is relatively simple (linearly separable) but will suce to illustrate the methodology. Each pattern is described by the six attributes with their corresponding values. Head shape 2 fround, square, octagonalg Body shape 2 fround, square, octagonalg Is smiling 2 fyes, nog Is holding 2 fsword, baloon, agg Jacket color 2 fred, yellow, green, blueg Has tie 2 fyes, nog The target is to learn the robots with the same head shape as the body shape or if they are wearing a red jacket. The training set consists of 123 selected patterns and the remaining 309 patterns were used for testing. We use LAP to illustrate the generalisation approach as it extracts the rules for each unit. The approach for RuleVI is very similar, except the process is applied only on one unit i.e. output unit. LAP is applied to the network trained for the monk1 data-set and a set of DNF expressions is extracted for a single hidden unit and an output unit. DNF expressions obtained by LAP for the hidden unit having high output were: 1. Head Shape 2 fround,squareg ^ Body Shape = octagon ^ Jacket Color = not-red 2. Head Shape = square ^ Body Shape 2 fround,octagon g ^ Jacket Color = not-red Let X denote the set of Head Shapes, Y the set of Body Shapes and Z the set of Jacket Colors. The ancillary predicates, inferring a high output for the hidden unit, are composed of the same attributes and can be written as hidden1 predicate1(round, octagon, not-red), hidden1 predicate1(square, octagon, notred) for the rst and hidden1 predicate2(square, round, not-red) for the second expression. Since there is no duplication of facts allowed, the fact with (square, octagon, not-red) is not written for the second expression. Each instantiated predicate (fact) contains only one value per attribute and as many attributes as DNF expressions involve. These facts can be inferred as generic predicates based on the determinations introduced by Russell in his thesis [7] to quan-

tify the knowledge. The generic predicate for the rst expression will be hidden1 predicate1(X,Y,Z) and hidden1 predicate2(X,Y,Z) for the second expression. DNF expressions produced by LAP for the hidden unit having low output were: 3. Jacket Color = red 4. Body Shape = square 5. Head Shape 2 fround, octagong ^ Body Shape 2 fround, squareg 6. Head Shape = octagon In a similar fashion, the ancillary concepts are generated to infer that the hidden unit has low output from the DNF expressions. The concepts are represented as hidden1 predicate3(Z), hidden1 predicate4(X), hidden1 predicate5(X,Y), hidden1 predicate6(Y) with respective facts such as hidden1 predicate3(red), hidden1 predicate4(square), hidden1 predicate5(round or octagon,round or square) and hidden1 predicate6(octagon). A concept de nition for the hidden unit is formed by collecting the dependencies among arguments and the rules are written as: 1. 8 X,Y,Z hidden1 predicate1(X,Y,Z) ) hidden 1(X,Y,Z) 2. 8 X,Y,Z hidden1 predicate2(X,Y,Z) ) hidden 1(X,Y,Z) 3. 8 X,Y,Z hidden1 predicate3(Z) ) : hidden 1(X,Y,Z) 4. 8 X,Y,Z hidden1 predicate4(Y) ) : hidden 1(X,Y,Z) 5. 8 X,Y,Z hidden1 predicate5(X,Y) ) : hidden 1(X,Y,Z) 6. 8 X,Y,Z hidden1 predicate6(X) ) : hidden 1(X,Y,Z) In cascade networks, an output unit is not solely dependent on the input units but also on the outputs of hidden units. When the decompositional rule extraction process is applied to the output unit, the resulting DNF expression contains an additional attribute corresponding to the hidden unit with the possible values in ffalse, trueg. A set of DNF expressions for the output unit having high output (named as monk) was: 7. Jacket Color = red ^ hidden 1 = false 8. Body Shape = octagon ^ hidden 1 = false 9. Head Shape 2 fround, squareg ^ Body Shape 2 fround, octagon g ^ hidden 1 = false 10. Head Shape 2 fround, squareg ^ Body Shape 2 fround, octagon g ^ Jacket Color = red 11. Head Shape = square ^ hidden 1 = false For brevity, we omit the expressions for the case where the robot is not a monk. A general predicate for the goal concept of a monk can be expressed as monk(X,Y,Z). We can further identify the ancillary concepts monk 1(Z), monk 2(Y), monk 3(X,Y) ,monk 4(X,Y,Z) and monk 5(X), with the respective facts monk 1(red), monk 2(octagon), monk 3(round or square, round or octagon), monk 4(round or square, round or octagon, red) and monk 5(square). The de nition of a monk is completed by utilising the de nition of hidden unit and ancillary monk concepts, resulting in the following rules:

7. 8 X,Y,Z monk 1(Z) ^ : hidden 1(X,Y,Z) ) monk(X,Y,Z) 8. 8 X,Y,Z monk 2(Y) ^ : hidden 1(X,Y,Z) ) monk(X,Y,Z) 9. 8 X,Y,Z monk 3(X,Y) ^ : hidden 1(X,Y,Z) ) monk(X,Y,Z) 10. 8 X,Y,Z monk 4(X,Y,Z) ) monk(X,Y,Z) 11. 8 X,Y,Z monk 5(X) ^ : hidden 1(X,Y,Z) ) monk(X,Y,Z) Such a knowledge base now allows to pose queries and hence provides an explanation of why this classi cation arose. For an example, if a query monk(square, square, not-red) is posed on the knowledge base, SHRUTI initiates and executes the appropriate rules for a given situations and returns the query true with an explanation:  monk(square,square,not-red) ( monk 5(square) ^ : hidden 1(square,square,not-red)  : hidden 1(square,square,not-red) ( hidden1 predicate4(square) In a similar fashion, the system was able to return true, whenever the query has the same value for Head shape and Body shape or red Jacket color. The system returned false for other instances where this condition is not met, such as monk(round,square,not-red) was returned false. SHRUTI does not assume the negation as failure, it returns do not know or not enough information about this instance if there are no facts to support or decline.

Table 3. The Network performance. Inputs to the networks are sparsely coded. The

number of nodes in an input layer is the total number of values for all the attributes in the data set. The network size is the number of input, hidden and output units. Instances that are incorrectly classi ed are given as ratio to the total instances (training or testing). As the attributes/values are removed from the data-sets that are not contributing to the networks' prediction, each data set is reduced in dimensions after the pruning process. The size of compressed data sets is reduced after the removal of duplicate instances. Data set (attributes)

Cascade Pruned-Cascade Network Classifn'-error Network Classifn'-error size training testing size training testing Monk1 (6) 17:1:1 0/123 0/309 7:1:1 0/19 0/18 Mushroom (22) 117:0:1 0/6100 0/2024 15:0:1 0/30 0/30 Remote sensing 15:0:1 0/69 2/16 14:0:1 0/69 2/16 hydrology (5) Remote sensing 15:1:1 0/69 0/16 15:1:1 0/69 0/16 forest (5) Voting (16) 48:0:1 1/259 8/170 15:0:1 2/144 2/60 Moral (23) 48:0:1 0/162 0/40 26:0:1 0/155 0/40

3 Experimental results The above methodology has been applied to several real world data sets. For the data sets with over 1000 instances, a three-fold cross validation was used; a ve-

fold cross validation was used for the data sets of size over 200 and under 1000; and for the data sets with fewer than 200 instances, a ten-fold cross validation was carried out. Each network started with an input layer and an output node. Once the training of a network reached an optimum solution, the network with the highest accuracy and lowest errors was chosen and the pruning process was initiated. The performance of the networks before and after the pruning process is shown in Table 3. The networks have classi ed the patterns correctly for all the data sets except the voting and remote-sensing hydrology. The performance of networks for the voting and remote-sensing hydrology data sets did not improve even with an addition of hidden units. After the pruning, networks had reduction in the size for all the data sets except the network trained for the remote-sensing forest data set. The pruned network for the voting data set has performed even better than the original network.

Table 4. Accuracy and Fidelity of the extracted propositional rule-sets. The instances

that are incorrectly classi ed are given as ratio to the total instances (training or testing). If an extracted rule-set outperforms the network then the delity is considered as positive otherwise negative. Data set

Cascade Pruned-Cascade Accuracy Fidelity Accuracy Fidelity training testing training testing Monk1 LAP 0/123 0/309 100 0/19 0/18 100 RuleVI 0/123 12/309 -97.22 0/19 0/18 100 Mushroom LAP 0/30 0/30 100 RuleVI 0/6100 0/2024 100 0/30 0/30 100 Hydrology LAP 0/69 2/16 100 0/69 0/16 +97.65 RuleVI 0/69 2/16 100 0/69 1/16 +98.82 Forest LAP 0/69 0/16 100 0/69 0/16 100 RuleVI 0/69 7/16 -91.76 0/69 7/16 -91.76 Voting LAP 2/144 2/60 100 RuleVI 1/259 23/170 -96.5 2/144 5/60 -98.53 Moral LAP 0/155 0/40 100 RuleVI 2/162 1/40 -98.5 0/155 1/40 -99.5

After pruning, networks have been trained for a few epochs. Once the network has been nalised, the rule-extraction methods LAP and RuleVI have been used to extract the propositional rule-sets. The Accuracy of a rule extraction method is a measure of the generalisation ability of rule-sets on an unseen data (how well the data is classi ed by an extracted rule-set). The Fidelity of a rule extraction method is a measure of the agreement between the network and an extracted rule-set (how well the rule-set can mimic the behaviour of network). The accuracy and delity performance of the rule-extraction methods on the networks before and after the pruning process is reported in Table 4. For monk1 data set, the RuleVI-extracted rule-set failed to classify some of the patterns

consisting the attribute Has tie. Since the nodes for the Has tie attribute were deleted during the pruning, the accuracy and delity is increased for an extracted rule-set in the pruned-cascade-RuleVI. As the dimension of search space is exponential in the number of values for all attributes in LAP, it failed to extract the rule-sets for mushroom, voting and moral data-sets. Though LAP has succeeded on the pruned networks to extract the rules with 100% accuracy and delity for these data-sets. Extracted rule-sets by LAP and RuleVI for the hydrology data set on the pruned network outperformed the unpruned network. The Comprehensibility of an extracted rule-set is a measure of the number of rules, antecedents, consequents, and actual attributes appearing in the rule-set. The Comprehensibility of the propositional rule-sets for the di erent data-sets is summarised in Table 5. For all data-sets, the number of rules and antecedents are signi cantly lower in the rule-sets for the pruned network as compared to the extracted rule-sets before the pruning. Based on the results shown in Table 4 and 5 it can be said that the pruning of the networks gives an overall better result in the rule extraction especially for the data sets involving a large number of attributes/values.

Table 5. Comprehensibility of the extracted propositional rule-sets. The number of consequents includes the negated instance of the target goals as well. Data set

Cascade Pruned-Cascade No of No of No of No of No of No of ante's cons's rules ante's cons's rules Monk1 LAP 63 4 18 45 4 16 RuleVI 61 2 22 24 2 12 Mushroom LAP 220 2 41 RuleVI 1038 2 481 52 2 12 Hydrology LAP 55 2 11 37 2 8 RuleVI 40 2 20 28 2 14 Forest LAP 171 4 30 171 4 30 RuleVI 103 2 40 103 2 40 Voting LAP 648 2 259 RuleVI 357 2 67 142 2 29 Moral LAP 1652 2 254 RuleVI 255 2 43 253 2 49

SHRUTI knowledge base generated by this process is the transformation of a neural network into an inference network, where input units in a network corresponds to the arguments of the predicates in SHRUTI knowledge base, the hidden units are mapped into the intermediate predicates and the output units are mapped into the consequent predicates at the highest level of hierarchy. The input space in a neural network also presents the total number of entities in a knowledge base. Due to the relative atness in inferential depth of the input space in neural networks, much more stringent queries are allowed to test

the rule quality or the inferential mechanism in SHRUTI model. Some of the comprehensibility measurements are reported in Table 6.

Table 6. Comprehensibility of SHRUTI knowledge base. Comprehensibility of the SHRUTI-knowledge representation formalism is the number of total concepts, subconcepts, facts, rules, predicates, arguments per consequents, arguments per antecedents, distinct entities used in the instantiated predicates. Data set

Cascade Pruned-Cascade No of No of No of No of No of No of concepts facts rules concepts facts rules Monk1 LAP 17 54 18 8 27 16 RuleVI 17 22 22 7 12 12 Mushroom LAP 21 134 41 RuleVI 117 470 230 15 18 9 Hydrology LAP 15 60 11 14 45 8 RuleVI 15 20 20 14 14 14 Forest LAP 15 127 23 15 127 23 RuleVI 15 40 40 15 40 40 Voting LAP 30 259 259 RuleVI 48 67 67 15 33 29 Moral LAP 26 415 254 RuleVI 48 43 43 26 49 49

4 Discussion The reason for choosing neural networks as a base tool is that the connectionist networks have the ability to represent extremely large data-sets and are universal approximators [5]. For any given function there is a connectionist network capable of approximating the function arbitrarily close. The generalisation capability of a neural network makes them a better alternative to symbolic methods. The cascade correlation architecture is chosen because it incrementally builds a network with a localised hidden unit representation and allows for the lateral connections among hidden units. These lateral connections can be a part of an inference path in the resulting knowledge base. Also the constructive algorithm eases the burden of specifying a prior network architecture. The extracted rules by LAP are of high delity because their classi cation ability is equivalent to the network from which they are extracted. A shortcoming of LAP and other decompositional techniques is that these methods become cumbersome when the search space is too large (number of attributes). As the number of dimensions increases, the number of possible combinations of attributes grows exponentially. Pruning allows the rule extraction methods to only consider the parts of the search space whose elements are the input attributes involved in the conjunction of target concepts. Since in LAP, the dimension of

search space is exponential in the number of values for all attributes, if all the concepts are not involved, the gain is signi cant. For instance, the search space is reduced by a factor of roughly 25 in the simple rst-monk data set. The extracted rule-sets, by RuleVI and other pedagogical methods, only completely 'cover' the set of instances that are used to obtain them. By removing attributes that do not have any impact on the target concept, increases the accuracy of such methods. A projection to a space of lower dimension translates a neural network into a compact rule-set, that seems impossible to solve with the high dimension. The proposed system also opens up the possibility to interact with the data sets by allowing to ask queries with an interface of a reasoning system like SHRUTI. SHRUTI is a connectionist architecture that can encode millions of facts and rules involving n-ary predicates and variables and performs a class of inferences in a few milliseconds. SHRUTI encodes knowledge in the rst-oder, function free Horn Clauses with the added restriction that any variable occurring in multiple argument positions in the antecedent of a rule must also appear in the consequent. By using the SHRUTI, the proposed system has the following advantages (1) inference time independent of the size of knowledge base; (2) inferencing complexity linear in the length of inference path; (3) support to parallel inferencing and mapping of a knowledge base onto a parallel architecture. SHRUTI permits user not to be an domain expert by allowing him/her to pose the partially instantiated queries. This type of queries can further be used to generate partially instantiated instances in a supervised learning task. Furthermore the method is able to represent the explicit negation of predicates in describing the goal concepts. The knowledge representation system developed by this project o ers a choice to a user of more than one language for the expression of domain knowledge such as the weights and architecture of a trained neural network, a propositional rule set and a predicate rule set.

5 Conclusion and Further work The present work integrates techniques of neural network training with a reasoning system and helps to automate the bottleneck of knowledge acquisition. Neural networks with rule extraction techniques learn the operating domain knowledge in the form of rules that is followed by SHRUTI as a nal component of the system. The proposed work solves the problem of understanding why a trained neural network or a data set makes a given decision and enables the user to understand their behaviour. There are a number of issues regarding our approach that we plan to pursue. The very important issue is scalability, we need to carry out experiments with more complex and larger problem domains. A second area that we plan to investigate is to employ this approach for networks classifying multiple concepts or prediction problem. Other algorithms that construct cascade type networks such as tower algorithm need to be investigated as they are fundamentally di erent and yield quite di erent rule sets.

References 1. R. Andrews, J. Diederich, and A. Tickle. A survey and critique of techniques for extracting rules from trained arti cial neural networks. Knowledge Based Systems, 8:373{389, 1995. 2. S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. In Advances in neural information processing systems. Morgan Kaufmann, 1990. 3. R. Hayward, C. Ho-Stuart, and J. Diederich. Neural networks as oracles for rule extraction. In Connectionist System for Knowledge Representation and Deduction. Queensland University of Technology, Brisbane, 1997. 4. R. Hayward, A. Tickle, and J. Diederich. Extracting rules for grammar recognition from cascade-2 networks. In Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, pages 48{60. Springer-Verlag, Berlin, 1996. 5. K. Hornik, M. Stinchcombe, and H. White. Multi-layer feedforward neural networks are universal approximators. Neural Networks, 2:359{366, 1989. 6. T. M. Mitchell. Machine Learning. The McGraw-Hill Companies, Inc, 1997. 7. S. Russel. Analogical and inductive reasoning. PhD thesis, Stanford University, 1986. 8. L. Shastri and V. Ajjanagadde. From associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings. Behavioral and Brain Science, 16:417{494, 1993. 9. J. W. Shavlik, R. J. Mooney, and G. G. Towell. Symbolic and neural learning algorithms: an experimental comparison. Machine Learning, 6:111{143, 1991. 10. S. Thrun and et. al. The monk's problems: a performance comparison of di erent learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, 1991. 11. G. G. Towell and J. W. Shavlik. Extracting re ned rules from knowledge -based neural networks. Machine Learning, 13:71{101, 1993. 12. A. S. Weigend, B. A. Huberman, and D. E. Rumelhart. Predicting the future: A connectionist approach. Technical Report P-90-00022, System Science Laboratory, Palo Alto Research Centre, California, 1990.