Dynamic Adaptation of Recurrent Neural Network ... - CiteSeerX

2 downloads 0 Views 298KB Size Report
Dynamic Adaptation of Recurrent Neural Network Architectures. Guided by Symbolic Knowledge. C.W. Omlin a, C.L. Giles b;c a Computer Science Department, ...
Dynamic Adaptation of Recurrent Neural Network Architectures Guided by Symbolic Knowledge C.W. Omlin , C.L. Giles Computer Science Department, University of Stellenbosch 7600 Stellenbosch, SOUTH AFRICA NEC Research Institute, 4 Independence Way, Princeton, NJ 08540, USA UMIACS, University of Maryland, College Park, MD 20742, USA a

b;c

a

b

c

Abstract The success and the time needed to train neural networks with a given learning algorithm depend on the learning task, the initial conditions, and the network architecture. Particularly, the number of hidden units in feedforward and recurrent neural networks is an important factor. We propose a novel method for dynamically adapting the architecture of recurrent neural networks trained to behave like deterministic nite-state automata (DFAs). It di ers from other constructive approaches in that our method relies on the continuous extraction and insertion of symbolic knowledge in the form of DFAs. The network architecture (number of neurons and weight con guration) changes during training based on the symbolic information extracted from undertrained networks. We successfully trained recurrent networks to recognize strings of a regular language accepted by a non-trivial, randomly generated deterministic nitestate automaton. Our empirical results indicate that symbolically-driven network adaptation results in networks train faster than networks trained with standard networks growing methods and show comparable generalization performance. Furthermore they generalize better than networks whose architectures remains unchanged during training.

Key words: Knowledge-based recurrent neural networks, dynamic network adaptation, symbolic knowledge

extraction, symbolic knowledge insertion, knowledge re nement. 1

Introduction

Incremental learning is an important paradigm for training neural networks. The term incremental can refer to the way complex concepts are learned starting with simple instances of the unknown concept (data1

incremental learning) or it can refer to network training procedures where the size of the architecture (number of hidden layers, neuron, or weights) increases dynamically during training (architecture-incremental learning). This paper shows how data-incremental learning can be synergestically combined with symbolic knowledge extraction and insertion to dynamically change the architecture of recurrent neural networks during training. The result is a heuristic for knowledge-based, architecture-incremental training of recurrent networks. Choosing an appropriate architecture for a learning task is an important issue in training neural networks. Because general methods for determining a `good' network architecture prior to training are generally lacking, algorithms that adapt the network architecture during training have been developed. These algorithms can be classi ed according to their di erent objectives. Destructive algorithms remove neurons or weights from large, trained networks and retrain the reduced networks in an attempt to improve their generalization performance In essence, they rely on the rule of thumb that larger networks tend to learn tasks more readily than smaller networks (larger number of parameters), whereas smaller networks tend to generalize better (no over tting). Removing neurons or weights is equivalent to reducing the number of parameters. The method proposed for feedforward network in [30] estimates the error induced by removing a neuron by multiplying the output of each neuron with a parameter . Those neurons for which the exponentially decaying average of the values @S @ is smallest are removed. Higher-order pruning methods include `Optimal Brain Damage' (OBD) [28] and `Optimal Brain Surgeon' (OBS) [24]. Both methods rank the weights according to their saliency and remove those weights with low saliency. However, OBD estimates the weight saliency as the direct change in training error without retraining, whereas OBS performs `retraining' using a local quadratic approximation. Both methods require the computation (or approximation) of the Hessian matrix @@w2 E2 which is computationally very expensive. i

A simple heuristic for pruning recurrent neural networks has been shown very e ective at improving both a trained network's generalization performance as well as the quality of the extracted rules [21]: After sucessful P training, the state neuron Si with the smallest incoming weight vector (i.e. j;k Wijk ) is removed and the network is retrained using the same training set. This process is repeated until either a network with satisfactory generalization performance is obtained or until the retraining fails to converge within a certain number of epochs. When the current network fails to converge, the network trained in the previous prune/retrain cycle is the solution network. While these methods are successful at improving the networks' generalization performance (and the accuracy of extracted symbolic knowledge), they do not solve the problem of choosing an initial network 2

architecture and successful training. Constructive algorithms eliminate the need to guess an appropriate, initial network architecture prior to training (e.g. [11, 12, 6]). Starting with small networks - usually a single neuron - they train until some criterion is met, indicating that further training will not lead to satisfactory performance on the training data set. A neuron along with weights is added and training is resumed until either the network's performance on the training set is satisfactory or until the criterion is met again. The cascade-correlation learning algorithm for recurrent neural networks [12] freezes the weights when a new neuron along with its weights is added to the network. A detailed discussion of the shortcomings of this approach when training recurrent neural networks to behave like deterministic nite-state automata have been discussed in [6, 26]. Essentially, the results show that there exist temporal learning tasks which cannot be represented by such networks. In order to overcome these representational limits, all weights must remain adaptable throughout the training process. It is important to note that these methods do not guarantee that an optimal architecture with respect to any criterion is found. Simple constructive methods just keep adding parameters (neurons and weights) until a solution to the learning task is found. Approaches for training recurrent neural networks with dynamic architecture based on genetic algorithms have also been proposed (see e.g. [47]).

The methods presented above for adapting the network architecture during training are complementary. Constructive algorithms can be used to nd a solution while destructive methods can be applied to improve the performance of already trained networks. None of the methods uses any symbolic domain knowledge to guide the network expansion, i.e. they consider insucient network capacity as measured by the number of hidden neurons to be the sole reason why a network may fail to converge. The remainder of this paper is organized as follows: In Section 2, we present a framework for combining symbolic and neural learning rst proposed in [39]. We introduce deterministic nite-state automata (DFAs) as a model for temporal processes and the recurrent network architecture trained to behave like DFAs in Sections 3 and 4, respectively. The heuristics we use for symbolic knowledge extraction and insertion are brie y discussed in Sections 5 and 6, respectively. We discuss the constraints we impose on how the extracted knowledge is used to initialize a new network and simulation results in Section 7. A summary of and potential future work conclude this paper. 2

Knowledge-Based Neural Networks

Recently, there has been a lot of interest in knowledge-based neural networks, i.e. the combining symbolic with neural learning ([4, 14, 20, 29, 40, 44, 41, 43, 39]). There are di erent ways in which neural and symbolic 3

Symbolic Representation Initial Domain Theory

Architecture Adaptation

Knowledge Insertion

Random Initialization

Initialized Neural Network

Refined Domain Theory

Knowledge Extraction

Training on Data

Trained Neural Network

Generalization

Connectionist Representation

Figure 1: A Framework for Combining Symbolic and Neural Learning: The use of a neural network for knowledge re nement usually consists of three steps: (1) Insertion of prior symbolic knowledge knowledge!insertion (initial domain theory) into a neural network, (2) re nement of knowledge through training a neural network on examples,knowledge!re nement, and (3) extraction of learned symbolic knowledge (re ned domain theory) from a trained network.knowledge!extraction The objective in this paper is to explore a possible fourth step: (4) Use symbolic knowledge extracted during the training process to dynamically adapt the network architecture. learning can be combined to solve a given learning task. An excellent collection of a variety of approaches can be found in [1]. In the following, we will limit our discussion to the paradigm illustrated in Figure 1. The traditional approach to using neural networks is shown in the lower part (`connectionist representation'). A network's adaptable weights are initialized with random values drawn according to some distribution. Using numerical optimization methods (e.g. gradient descent techniques, simulated annealing), the network is trained on some known data to perform a certain task (e.g. pattern classi cation) until some training criterion is met. After successful training, a network can take advantage of its generalization capabilities to perform the intended task on arbitrary data. Notice that during the entire process, the knowledge remains `hidden' in a network's adaptable connections; hence the name `connectionist representation'. The above training paradigm can be enriched with symbolic knowledge in the following way (`symbolic representation'): Prior knowledge about a task (initial domain knowledge) is used to initialize a network prior to training. This requires a translation of the information in symbolic into a connectionist representation. The particular method for converting the symbolic representation of knowledge into its equivalent 4

connectionist representation depends on the kind of symbolic knowledge, the learning task, and the network model used for learning. To date, most e orts are directed towards encoding prior knowledge by programming some network weights to speci ed values instead of choosing small random values. The programmed weights de ne a starting point for the search of a solution in weight space. The premise is that a better solution will be found faster compared to starting the search from a random point in weight space. Examples of this approach include prestructuring of feedforward networks with Boolean concepts (see e.g. [16, 42]) and imposing rotation invariance in neural networks for image recognition (e.g. [2]). We should point out that other types of prior knowledge encoding are possible. For instance, rotation-invariance can also be achieved through training, i.e. presenting examples of rotated objects as inputs to a network. The choice of a network architecture itself represents an implicit use of prior knowledge about an application. Once a network has succeeded in learning a task as measured by its performance on the training data, it may be useful to extract the learned knowledge. The question arises whether it is possible to extract an adequate symbolic representation of the knowledge learned by a network, i.e. a representation that captures the essence of the learned knowledge. In many cases, the extracted knowledge may only approximate a network's true knowledge; however, it is also possible for the extracted symbolic representation to exceed the accuracy of the knowledge stored in a trained network [31]. In some instances, networks trained with prior knowledge also facilitate the extraction of symbolic knowledge from trained networks. For feedforward networks, propositional rules have been the main paradigm for symbolic knowledge; for recurrent networks, deterministic nite-state automata have been the main vehicle for addressing the issues mentioned above. While good empirical results have been achieved using the framework which combines neural and symbolic learning described above, the merits underlying the symbolic/connectionist approach are not yet well understood. Gaining that insight remains an important open research problem. Recently, a novel approach for the expansion of feedforward neural during training was proposed which is guided by the training data and the domain knowledge stored in a knowledge-based network [36]. The goal is to nd a network topology which best re nes an initial domain theory expressed in propositional logic. Based on a symbolic representation of a network undergoing training, the method identi es nodes which are the primary contributors to network classi cation errors; then, nodes are added in a fashion similar to adding rules and conjuncts to a rule base. The success of this approach relies on the assumption that, in knowledge-based networks, sigmoidal discriminants can be approximated by hard-limiting threshold functions; consequently, neurons can be interpreted as Boolean operators AND and OR. The algorithm maintains 5

a population of a xed number of knowledge-based networks where each network has been expanded by one antecedent neuron. Each network is retrained and ranked according to its performance on a network validation set. The best network is chosen for further possible expansion. The proposed heuristic for expanding networks tends to create networks with many sparse hidden layers because of its selective expansion strategy. Empirical results conclusively demonstrate that networks whose architecture is expanded in this fashion show a generalization performance that is superior to the performance of knowledge-based networks that are not expanded during training and also superior to a naive network growing method which indiscriminantly adds hidden neurons that are connected to all input and output neurons until network training converges. For the case where recurrent neural networks are trained to behave like a deterministic nite-state automaton (DFA), we have shown how second-order recurrent networks can be initialized with partial (symbolic) knowledge of a DFA, leading to faster convergence ([20, 34]). It is also possible to extract a symbolic representation of the learned language in the form DFAs from trained network ([17, 18, 19]). The extracted rules always outperform the trained network they were extracted from [35, 32]. The focus of this work is on how symbolic knowledge can guide recurrent network learning. In particular, we are interested in using symbolic knowledge extracted during training to guide the growth of an adaptive network architecture. The use of symbolic knowledge to adapt a recurrent network architecture during training is a novel idea. Our method applied in a greedy manner could be considered a combination of constructive and destructive approaches to neural network training, since the network is allowed to grow and shrink based on the information extracted from the network undergoing training. However, because of some constraints we are imposing, we classify our algorithm as a constructive method. 3

Regular Languages

Regular languages represent the smallest class of formal languages in the Chomsky hierarchy [25]. They are generated by regular grammars. A regular grammar G is a 4-tuple G =< S; N; T; P > where S is the start symbol, N and T are non-terminal and terminal symbols, respectively, and P are productions of the form A ! a or A ! aB where A; B  N and a  T . The regular language generated by G is denoted L(G). Associated with each regular language L is a deterministic nite-state automaton (DFA) M which is an acceptor for the language L(G), i.e. L(G) = L(M ). DFA M accepts only strings which are a member of the regular language L(G). Formally, a DFA M is a 5-tuple M =< ; Q; R; F;  > where  = fa1; : : : ; ak g is 6

2

8

7

10

1 4 6

9

5

3

Figure 2: Random 10-State DFA: State 1 is the start state; solid and dashed arcs denote state transitions on input symbols '0' and '1', respectively; accepting states are represented as double circles. the alphabet of th e language L, Q = fs1 ; : : : ; sM g is a set of states, R  Q is the start state, F  Q is a set of accepting states and  : Q   ! Q de ne state transitions in M . A string x is accepted by the DFA M and hence is a member of the regular language L(M ) if an accepting state is reached after the string x has been read by M . Alternatively, a DFA M can also be considered a generator which generates the regular language L(M ). An example of a DFA with 10 states and two inputs symbols is shown in Figure 2. Training recurrent neural networks on positive and negative example strings along with a heuristic for extracting symbolic knowledge from trained networks, allow us to perform grammatical inference, i.e. deriving the grammatical rules that generate the strings classi cations. While recurrent neural networks are not yet competitive with symbolic algorithms tailored to the task of grammatical inference, they provide an excellent testbed for addressing fundamental issues, e.g. learning algorithms and symbolic knowledge representation. Furthermore, automata and their languages can model a large class of dynamical processes, and no feature extraction is generally needed for learning. 4

Recurrent Neural Network

We use discrete-time, recurrent networks with weights Wijk to learn grammars ([7, 9, 18, 37, 38, 45, 48]). A network accepts a time-ordered sequence of inputs and evolves with dynamics de ned by the following equations:

7

i 

Si(t+1) = g(i + bi );

X j;k

Wijk Sj(t) Ik(t) ;

where g is a sigmoid discriminant function and bi is the bias associated with hidden recurrent state neurons Si . The weights Wijk are updated according to a second-order form of the real-time recurrent learning algorithm based on gradient descent ([46]). For more details see [18]. 5

Knowledge Extraction

We extract symbolic rules about the learned grammar in the form of DFAs. The extraction algorithm is based on the hypothesis that the outputs of the recurrent state neurons tend to cluster when the network is well-trained and that these clusters correspond to the states of the learned DFA ([18]). Thus, rule extraction is reduced to nding clusters in the output space of state neurons and transitions between clusters. Our algorithm employs a dynamical state space exploration along with pruning heuristics to make the extraction computationally feasible. We choose an extraction algorithm based on a partitioning of the output space of recurrent neurons [31]; other algorithms have been proposed (e.g. [8]). The extraction algorithm divides the output of each of the N state neurons into r intervals of equal size, yielding rN partitions in the space of outputs of the state neurons1. We also refer to r as the quantization level. Each partition is an N-dimensional cube with edge length r1 . From some de ned initial network state, a network's trained weights will cause it to follow a trajectory in state space as symbols are presented as input. The algorithm considers all strings in alphabetical order starting with length 1. This procedure de nes a search tree with the initial state as its root; the number of successors of each node is equal to the number of symbols in the input alphabet. Links between nodes correspond to transitions between DFA states. The search is performed in breadth- rst order. When a transition from a partition reaches another (not necessarily di erent) partition, then all paths from that new network state are followed for subsequent input symbols creating new states in the DFA for each new input symbol, subject to the following three conditions: (1) When a previously visited partition is reached, then only the new transition is de ned between the previous and the current partition; no new state is created. This corresponds to pruning the search tree at that node. 1

The theoretical foundations for this DFA extraction algorithm have been discussed in [5].

8

(2) In general, many di erent network states belong to the same partition. We use the network state which rst reached a partition as the new network state for the subsequent input symbols fa1; a2 ; : : : aK g. (This condition only applies when two or more transitions from a partition to another partition are extracted.) (3) When a transition leads from a partition immediately to the same partition, then a loop is created and the search tree is also pruned at that node. The algorithm terminates when no new states are created and all possible transitions from all states have been extracted. Extracted states are labeled accepting and rejecting if the value of the network output neuron is less than 0.5 and greater than 0.5, respectively. We apply a standard automaton minimization algorithm to each extracted automaton [25]. Clearly, the extracted automaton depends on the quantization level r chosen, i.e., in general, di erent automata can be extracted for di erent values of r. We solve the model selection problem by choosing the rst automaton that correctly classi es the training set [31]. Furthermore, because of condition (2), di erent automata may be extracted depending on the order in which the successors of a node in the search tree are visited. In general, however, the order in which input symbols are presented is not signi cant because of the subsequent minimization of the extracted automaton. The speci c issues of the extraction algorithm and quality of the extracted rules are discussed in [18, 31]. It is important to note that the DFA extraction can be performed on an initialized network prior to training, during training, and after training. The extracted DFA then represents the initial knowledge, the partial knowledge the network is gaining during training, and the learned DFA after training, respectively. 6

Knowledge Insertion

The algorithm used here to insert rules into a second-order network is discussed in detail in [20, 34]. For a method for inserting prior knowledge into rst-order networks see [13, 14, 15]. Given some partial information about the DFA (states and transitions), rules are inserted into a secondorder network which (partially) de ne a nearly orthogonal internal representation of the DFA states. Given knowledge about n states and m input symbols, a network with n+1 recurrent state neurons and m input neurons is constructed. The algorithm consists of two parts: We program weights of a network to re ect known state transitions (qj ; ak ) = qi and the weights that determine the output of the response neuron for each DFA state. Neurons Sj and Si correspond to DFA states qj and qi , respectively. The weights Wjjk , Wijk , W0jk , and biases bi are programmed with weight strength H as follows: 9

8 < +H Wijk = : 0 8< +H Wjjk = : ?H 8 < +H W jk = : ?H 0

if (qj ; ak ) = qi otherwise

(1)

if (qj ; ak ) = qj otherwise

(2)

if (qj ; ak )  F otherwise bi = ?H=2 0  i  n

(3) (4)

For each known state transition, at most three weights of the network have to be programmed. This encoding is similar to the encoding scheme for second-order networks with hard-limiting threshold functions discussed [23]. The initial state S0 of the network is S0 = (S00 ; 1; 0; 0; : : :; 0). The initial value of the response neuron S00 is 1 if the DFAs initial state q0 is an accepting state and 0 otherwise. The network rejects a given string if the value of the output neuron S0t at the end of the string is less or equal 0.5; otherwise, the network accepts the string. Such prior knowledge de nes an inductive bias for learning; it presumably de nes a good starting point from where to search for a solution in weight space. Previously, we have shown how insertion and extraction of symbolic knowledge can be applied to knowledge revision, where prior knowledge about (some) grammatical rules are inserted into a network and the compared with the DFA extracted after training ([22, 33]). We have shown that recurrent networks are suitable for knowledge revision in that they preserve genuine rules (i.e. rules which actually exist) and they are able to correct wrong rules. For this application of insertion of DFAs into networks, we assume that the number of recurrent state neurons (not including the special response neuron) is determined by the DFA extracted during training. 7

Training Networks

7.1 Data-Incremental Learning Data-incremental learning is an important strategy for learning temporal patterns ([10]). Long example strings represent long-term dependencies; because the gradient information vanishes for long strings, learning becomes increasingly inecient as the temporal span of the dependencies increases [3]. Thus for learning regular languages, we assume that the sample strings are ordered according to their length. Our strategy is 10

to learn in cycles, where each cycle consists of three distinct phases: (1) Train the network on a small subset of the training data set (\working set"). This initial data set will contain the shortest strings only. These strings represent short-term dependencies which are sucient to infer similar, but longer term dependencies. (2) The training cycle ends when the network performs satisfactorily on the working set or when a maximum number of epochs in this cycle has expired. (3) The network is then tested on the entire available training set. If the network does not perform satisfactorily, then a xed number of strings are added to the working set and a new cycle starts. Otherwise, the training algorithm terminates.

7.2 Architecture-Incremental Learning 7.2.1 Generic Algorithm Often, when a network fails to converge to a solution, it is because the number of neurons and weights is insucient. In these cases, the number of available parameters is increased by simply adding a neuron ([6, 12]). We can identify two di erent criteria to decide when to update the network architecture: The periodic updating heuristic changes the network after a xed number of training epochs regardless of the learning progress made or after convergence on the current working set. The update-after-failure heuristic only changes the architecture after the network has failed to learn the current training set. The latter is the prevalent heuristic used to dynamically adapt a network architecture during training; we will use that heuristic for the remainder of this paper. Our method di ers from other heuristics in the way the network architecture is adapted. Instead of adding a single neuron to the current network architecture, we extract symbolic knowledge in the form of a DFA from the current network, and use this knowledge to initialize a new network whose size is determined by the size of the extracted DFA. Thus, we construct a sequence of recurrent networks 1 ; 2 ; : : : ; p where each i is initialized with symbolic knowledge Mi?1 extracted from i?1 . The network 1 is a network of minimal size if no prior knowledge about the DFA is given; otherwise, 1 is initialized with the prior knowledge which also determines the size of the initial network architecture. The network p is the nal network whose performance is consistent with the entire training set; Mp is the DFA extracted from p that also correctly classi es the entire training set.

7.2.2 Greedy Algorithm The greedy approach to architecture adaption during training is straight-forward. A DFA Mi that is consistent with the current working set is Ti extracted from the network i , and a new network i+1 is initialized with Mi . Thus, the size of the new network is equal to to the size of the extracted DFA (not including the special response neuron). The time of DFA extraction and network initialization is chosen according to the 11

update schedule for adapting the network architecture. While this greedy approach is very simple, it su ers from two disadvantages. (1) Some of the production rules in the DFA extracted during training, particularly during the early stages of the training, are generally not the ones in the nal DFA; i.e. there is the potential that networks are initialized with wrong rules which the networks have to unlearn; this is clearly undesirable. The more serious aw of the greedy approach is the following: (2) During training, the size of the extracted DFA tends to be much larger than the size of the DFA extracted after successful training. For the purpose of illustration, we have trained a xed 11-neuron network on a data set generated by the DFA shown in Figure 2. The training set consisted of the rst 500 positive and the rst 500 negative example strings arranged in alphabetical order. A typical progression of the number of DFA states extracted from a network undergoing training after each cycle is shown in Figure 3. For DFA extraction, we chose the smallest quantization level r for which the DFA was consistent with the current working set. We make the following observations: (1) The number of states in the extracted DFA initially increases rapidly with the size of the working set. After further training, the size of the extracted DFA decreases. (2) As the working set increases, the number of additional training epoch necessary to learn the new strings decreases. This serves as empirical evidence that, in the late stages of training, networks can rely on the dynamics induced in the early stages of training when strings were short; training amounts to nothing more than knowledge re nement. (3) Knowledge re nement a ects DFA states and transitions which are furthest from the start state. There are two main reason for the large number of states in DFAs extracted during the early stages of training: (1) The internal representation of DFA states is not well de ned, i.e. network states show poor clustering. Therefore, the number of DFA states will be larger for some time during the training. As training progresses, the network learn to correctly classify more and longer strings while simultaneously the clustering of internal DFA representations improves. (2) Many algorithms for grammatical inference are based on merging of states of a canonical DFA. More states can be merged when a sucient amount of training data is available yielding smaller DFAs with better generalization performance. A similar behavior has been recently observed for a symbolic learning algorithm ([27]). The combination of our knowledge encoding algorithm with the greedy approach would result in a rapid increase of the number of DFA states, and thus the network size. The computational cost of training large networks is prohibitive without the use of ecient massively parallel implementations of network training algorithms. The only alternative is to construct a denser knowledge encoding; however, the resulting network has fewer degrees of freedom. This could seriously jeopardize a network's ability to revise incorrect symbolic 12

26 21 22 11

15

15

9 23

24 14

17

25

12 13

2

14

18

20

9

4 12

2 16

5

10

19

1 1

1 10 5 3

7

8

3

8 13 7

6

cycle 0 (t=0)

6

4

cycle 1 (t=40)

22

11

cycle 2 (t=90) 26

28

16

25 20

20 21 12

7

12

11

23 19

7

24

29

2 4

4

13

5

10

9

cycle 3 (t=128)

15

cycle 4 (t=162)

30

19 10

21

14

7

20 23

25

8

cycle 5 (t=176)

19

7

33

26

9

24

32

28

12 22

17

17

18

35

29 4

19

24

12 2

18

23

4

15

18

1

6

16

16

3 3

11

6

9

5 5

cycle 6 (t=185)

8

8

cycle 7 (t=210)

cycle 8 (t=219)

19

17 15

7

6

9

14

8

10

15

13

1

13 10

5

13

4

17

11 9

2

14

15 21

22

20

3

11

20

27

2

12

6

5

16

34

3

8

10

1

16 15

10

11

6

5

31

17 14

3

13

7

18

4

1

17

8

12

2 16

14

10

6

21

18

15 1

9

22

22

14

3

13

20

19

24

30

28 23

21

18

1

19 23

17

2

25

27

7

11

27

25

24

29

31

26

10

14

7

10

7

2

4

14

16 16

18 13 2

4

2

15

4

13

19 12

17

1

1

12

11 6

3

11

3

6

3

9

1

6 9

9 18 5

5

8 5

cycle 9 (t=225)

8

8

cycle 10 (t=233)

cycle 11 (t=239)

Figure 3: DFA Extraction during Training: After each cycle, DFAs that were consistent with the current working set were extracted from a network with 10 recurrent state neurons during training. The number of training strings in each working set is equal to cycle * 10. This incremental training heuristic results in a progression of extracted DFAs which incrementally approaches the correct DFA, i.e. states and transitions close to the start state are learned early in the training process; the short training times at later stages illustrate that longer strings are primarily used to re ne the knowledge about `distant' states and transitions. It is typical for the size of extracted, minimized DFAs to collapse in this process.

13

knowledge. Thus, we adopt the following ojectives for adapting the network architecture during training: (1) Adapt the network architecture according the the learned knowledge. (2) The previously learned knowledge is to be inherited by the newly initialized network with minimal loss of information. (3) Only knowledge consistent with the data should be inherited. (4) The size of the network during training should not grow beyond bounds (e.g. the number of states in the nal DFA). These objectives are met by the method decribed below.

7.2.3 DFA Pruning Algorithm Using the symbolic knowledge insertion and extraction paradigm for adapting the size of a recurrent network architecture, the number of recurrent state neurons is determined solely by the number of DFA states that are encoded in a new network; the number of transitions is irrelevant. Following the discussion of how the number of extracted DFA states tends to increase during training, the objective of the DFA pruning algorithm is to eliminate states and transitions which are `unlikely' to be present in the DFA that generated the training set, and thus limit the network to a manageable size. We employ three di erent heuristics for keeping the network size from exploding. Ideally, we want to extract only DFAs that are consistent with the current working set, i.e. correctly classify all strings that the network is currently trained on. However, this may either be impossible or undesirable: It is impossible to extract a consistent DFA when the current network has failed to learn to correctly classify all strings of the working set regardless of how large the quantization level r is chosen. In the early stages of training, the complexity of the DFA that correctly classi es the current working set often increases rapidly, and the current network size may be insucient to accommodate that increased complexity. However, even a network with imperfect classi cation performance has learned something from the current working set; the objective is to transfer some of the imperfect knowledge to a new, larger network. In such a situation, we extract DFAs with the smallest quantization level such that the number of strings in the current working set that are misclassi ed remains constant, i.e. increasing the quantization level does not result in improved DFA classi cation performance on the working set. In situations where a network has successfully learned all strings of the current working set, a consistent DFA may only be extracted for a larger quantization level r. However, a large quantization level generally also implies a larger number of extracted DFA states. For illustration, consider the number of partitions that potentially correspond to DFA states in a network with 10 recurrent state neurons for quantization levels r = 2 and r = 3: The number of partitions increased from  1; 000 to  60; 000. Even with DFA minimization, it is plausible that the number of DFA states increases dramatically. We know from experience 14

2

4

2

4

3

6

1 1

1

3

5

cycle 0 (t=0)

cycle 1 (t=132)

cycle 2 (t=273)

7 10

7

11

2

11

7

2

4

12

4 2

1

14

4

9

1

13

3

6

1

8

3

10

6 8 16 3

5

9

6

5

5

cycle 3 (t=664)

cycle 4 (t=707)

7

11

8

cycle 14 (t=1313)

14

2

8

7

10

15 2

4

10 13 12

1

1 3

5

9

4

6

6

5

8

cycle 15 (t=1329)

9

3

cycle 16 (t=1337)

Figure 4: Sequence of Extracted/Inserted Knowledge: Shown is progression from no prior knowledge to successive extracted pruned, partial DFAs that were used to initialize the next network. A new network was initialized only when the current network failed to learn the current working set, no network changes were made from cycle 5 through 14. The number of known states determines the number of recurrent state neurons for each new network. See text for details of the DFA pruning algorithm and how weights were programmed.

15

that inconsistent DFAs extracted with smaller quantization levels are generally very good approximations to the re ned knowledge of DFAs extracted with larger quantization levels, i.e. only a very small number of strings of the current working set may be misclassi ed. Thus, by choosing to initialize the next network with a smaller, `weakly consistent' DFA, we trade accuracy for size. This next network which is generally larger than previous networks is easily capable of re ning that approximate knowledge through training. The above DFA pruning heuristic only considers consistency with the current working set. The second heuristic eliminates from the extracted DFA all states and transitions which are inconsistent with the entire training set. This step simultaneously further reduces the network size and limits the amount of wrong knowledge that is passed on to subsequent networks. We determine the amount of symbolic knowledge to be passed on to the next network generation as follows: Let G(Mi ) denote the graph underlying a DFA Mi . Given an extracted DFA Mi , a spanning tree GST (Mi ) is constructed from Mi using a breadth- rst traversal strategy. This ensures that state transitions close to the start state will belong to GST (Mi ). (We know from experience that state transitions in a DFA extracted during training that are close to the start state also tend to be present in the DFA extracted from a trained network.) This graph operation leaves a partially de ned DFA Mi0 with underlying graph GST (Mi ). In order to limit the size of the next generation network, Mi0 is checked against all strings of the entire training set that use only production rules contained in Mi0 (\rule strings"). Whenever a rule string is misclassi ed by Mi0 , the last state transition is removed from Mi0 . A subset of all states of Mi0 are left unreachable; these unreachable states are removed by traversing the pruned spanning tree GST (Mi ). In order to see that the spanning tree is necessary, consider performing the pruning operation in the original extracted DFA Mi . In any DFA with loops, pruning the original DFA may result in valid state transitions being removed. The underlying premise that state transitions far from the start state are more likely to be invalid than state transitions close to the start state is no longer supported by the pruning operation because transitions close to the start state may be discarded while transitions with greater depth are retained. The construction of GST (Mi ) may have removed some correct state transitions, notably loops. State transitions whose source and target states were not pruned from GST (Mi ) are provisionally reinserted one at a time. If the reinserted state transition does not result in a misclassi cation of any strings in the training set, then that reinsertion becomes permanent; otherwise, the transition is discarded. Finally, we discard any transitions for which there exist no strings in the current working set, i.e. transitions which are incidental based on the network dynamics induced through learning. 16

This pruning procedure leaves a partially de ned pruned DFA Mi+1 , i.e. state transitions may not de ned for all states and all possible input symbols. Prior to the next cycle, Mi+1 is encoded into a new network i+1 and a new working set Ti+1 is obtained by adding a xed number of strings to Ti . The rule strength H is chosen such that the rules in the pruned DFA are also extracted from the new initialized network, i.e. the smallest such value H is chosen. The objective is to set H to a large enough value to guarantee knowledge inheritance without biasing the network excessively; setting H too large can lead to longer training times. The value H increases with increased complexity of the (partial) DFA Mi+1 .

7.3 Simulation Results We trained 10 di erent networks using (1) static networks whose topology remained unchanged during training, (2) a standard network growing technique, and (3) our proposed network adaptation based on symbolic knowledge extracted during training. Weights that were not programmed to +H or ?H were initialized to small random values drawn from the interval [-0.1, 0.1]. The training set consisted of the rst 1,000 strings sorted in alphabetical order whose labels were generated by the random 10-state DFA shown in Figure 2. Training of newly created networks was restarted with the initial working set T1 and expanded to T1 ; : : : ; TC where TC is the current working set. The purpose of this training strategy was to facilitate relearning of correct knowledge from previous working sets T1 ; : : : ; TC that may have been lost during an extraction/insertion sequence. The initial working set consisted of the rst 10 strings of the training set; its size was increased by 10 prior to each cycle. We arbitrarily chose 3 as the maximum number of all strings in the current working set that an extracted DFA Mi may misclassify in order to be considered `weakly consistent'. The results of our simultions are summarized in Table 1. We were able to extract the perfect 10-state DFA for all runs. For the standard network growing method, the size of the nal network varied between 9 and 12. For the symbolic method, the size varied between 9 and 14. We tested the trained networks on all strings up to length 15; we observe that the generalization performances are comparable across all three methods. However, training for the standard network growing method is signi cantly worse compared to the other two methods. Since only one neuron at a time is added, these networks often fail to learn the current working set within 300 epochs. Furthermore, they often also fail to converge when retrained on previous working sets T1; : : : ; TC ?1 even after a neuron has been added. Training with the symbolic method, however, often consists of either merely re ning the newly encoded knowledge or stabilizing the already induced network dynamics for longer strings. In both cases, the training e ort needed to learn the current working set is small. Clearly, symbolic knowledge extracted during the learning process can be used e ectively to guide the dynamic growth of recurrent neural networks.

17

run static network (n=11) standard growing technique symbolic guidance # epochs generalization error epochs generalization error epochs generalization error 1 654 2.79% 4,461 0.89% 1,337 1.32% 2 831 1.91% 3,852 1.61% 1,491 0.72% 3 804 3.62% 5,746 2.09% 1,274 3.40% 4 1,073 2.83% 4,607 1.64% 1,642 1.74% 5 962 4.96% 6,043 3.53% 1,838 2.82% 6 791 0.93% 4,371 2.81% 1,048 0.96% 7 1,306 2.81% 3,786 3.42% 1,559 1.77% 8 633 3.73% 4,483 0.38% 1,628 2.82% 9 879 0.91% 5,227 1.98% 1,781 1.95% 10 1,441 1.07% 4,973 0.85% 1,309 2.24%

Table 1: Performance Comparison: Shown are the training and generalization performance of networks trained without dynamic network adaption, networks which were trained by adding a single neuron any time training failed to converge on a working set, and networks whose architecture was determined by extracted symbolic knowledge. 8

Conclusions

We have presented a heuristic which allows the architecture of recurrent neural networks to change dynamically during training. It is a novel approach to adapting a network architecture during training as it is based on algorithms for the extraction and insertion of symbolic knowledge in recurrent networks during training. Information is transferred between source and target networks of di erent sizes during training in the form of deterministic nite-state automata (DFAs). Our knowledge insertion algorithm requires a one-to-one correspondence between DFA states and recurrent state neurons; the size of the target networks is thus determined by the size of the DFAs extracted from the source networks. The success of our heuristics depends on the availability of short training strings. However, short strings are also required for successful training with standard gradient descent algorithms due to the long-term dependency problem [3]. Essentially, gradient information vanishes for long strings, i.e. the network becomes unable to distinguish states across long input strings. Thus, data-incremental learning synergistically combines with architecture-incremental learning for an e ective training procedure for recurrent neural networks. Network architecture updates were performed according to an update-after-failure policy, i.e. the network architecture was changed after it failed to converge on the current working set within a xed number of epochs; 18

other criteria are possible. The transfer between networks is achieved such that no amount of information loss occurs, i.e. the target networks contain all the information considered valid from the source networks. In order to minimize inheritance of wrong information and simultaneously prevent excessive growth of the network size, we pruned transitions deemed invalid from the extracted DFA by testing them on the entire training set. We have shown that our approach of guiding the growth of the network with symbolic knowledge is successful and reduces the training times compared to networks with xed architecture and standard network growing methods which add a single neuron at a time to a network. We do not claim that our constructive algorithm nds an optimal solution with respect to any criterion (e.g. smallest possible network size). It would be interesting to test our method for training sets that are sparse, i.e. for cases where not all strings up to a certain length are available, and for larger DFAs. References

[1] University of Florida and American Association for Arti cial Intelligence, Proceedings of the International Symposium on Integrating Knowledge and Neural Heuristics, (Pensacola, Florida), May 9-10 1994. [2] E. Barnard and D. Casasent, \Invariance and neural nets," IEEE Transactions on Neural Networks, vol. 2, pp. 498{508, 1991. [3] Y. Bengio, P. Simard, and P. Frasconi, \Learning long-term dependencies with gradient descent is dicult," IEEE Transactions on Neural Networks, vol. 5, pp. 157{166, 1994. Special Issue on Recurrent Neural Networks. [4] H. Berenji, \Re nement of approximate reasoning-based controllers by reinforcement learning," in Machine Learning, Proceedings of the Eighth International International Workshop (L. Birnbaum and G. Collins, eds.), (San Mateo, CA), pp. 475{479, Morgan Kaufmann Publishers, 1991. [5] M. Casey, \The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction," Neural Computation, vol. 8, no. 6, pp. 1135{1178, 1996. [6] D. Chen, C. Giles, G. Sun, H. Chen, Y. Lee, and M. Goudreau, \Constructive learning of recurrent neural networks," in Computational Learning Theory and Natural Learning Systems III (T. Petsche, S. Judd, and S. Hanson, eds.), (Cambridge, MA), MIT Press, 1993. To appear. [7] A. Cleeremans, D. Servan-Schreiber, and J. McClelland, \Finite state automata and simple recurrent recurrent networks," Neural Computation, vol. 1, no. 3, pp. 372{381, 1989. 19

[8] S. Das and M. Mozer, \A uni ed gradient-descent/clustering architecture for nite state machine induction," in Advances in Neural Information Processing Systems 6 (J. Cowan, G. Tesauro, and J. Alspector, eds.), (San Francisco, CA), pp. 19{26, Morgan Kaufmann, 1994. [9] J. Elman, \Finding structure in time," Cognitive Science, vol. 14, pp. 179{211, 1990. [10] J. Elman, \Incremental learning, or the importance of starting small," Tech. Rep. CRL Tech Report 9101, Center for Research in Language, University of California at San Diego, La Jolla, CA, 1991. [11] S. Fahlman, \The cascade-correlation learning architecture," in Advances in Neural Information Processing Systems 2 (D. Touretzky, ed.), (San Mateo, CA), pp. 524{532, Morgan Kaufmann Publishers, 1990. [12] S. Fahlman, \The recurrent cascade-correlation architecture," in Advances in Neural Information Processing Systems 3 (R. Lippmann, J. Moody, and D. Touretzky, eds.), (San Mateo, CA), pp. 190{196, Morgan Kaufmann Publishers, 1991. [13] P. Frasconi, M. Gori, M. Maggini, and G. Soda, \A uni ed approach for integrating explicit knowledge and learning by example in recurrent networks," in Proceedings of the International Joint Conference on Neural Networks, vol. 1, p. 811, IEEE 91CH3049-4, 1991. [14] P. Frasconi, M. Gori, M. Maggini, and G. Soda, \Uni ed integration of explicit rules and learning by example in recurrent networks," IEEE Transactions on Knowledge and Data Engineering, vol. 7, no. 2, pp. 340{346, 1995. [15] P. Frasconi, M. Gori, and G. Soda, \Injecting nondeterministic nite state automata into recurrent networks," tech. rep., Dipartimento di Sistemi e Informatica, Universita di Firenze, Italy, Florence, Italy, 1993. [16] L. Fu and L. Fu, \Mapping rule-based systems into neural architecture," Knowledge-Based Systems, vol. 3, no. 1, pp. 48{56, 1990. [17] C. Giles, D. Chen, C. Miller, H. Chen, G. Sun, and Y. Lee, \Second-order recurrent neural networks for grammatical inference," in Proceedings of the International Joint Conference on Neural Networks 1991, vol. II, pp. 273{281, July 1991. [18] C. Giles, C. Miller, D. Chen, H. Chen, G. Sun, and Y. Lee, \Learning and extracting nite state automata with second-order recurrent neural networks," Neural Computation, vol. 4, no. 3, p. 380, 1992. [19] C. Giles, C. Miller, D. Chen, G. Sun, H. Chen, and Y. Lee, \Extracting and learning an unknown grammar with recurrent neural networks," in Advances in Neural Information Processing Systems 4 20

(J. Moody, S. Hanson, and R. Lippmann, eds.), (San Mateo, CA), pp. 317{324, Morgan Kaufmann Publishers, 1992. [20] C. Giles and C. Omlin, \Inserting rules into recurrent neural networks," in Neural Networks for Signal Processing II, Proceedings of The 1992 IEEE Workshop (S. Kung, F. Fallside, J. A. Sorenson, and C. Kamm, eds.), pp. 13{22, IEEE Press, 1992. [21] C. Giles and C. Omlin, \Pruning recurrent neural networks for improved generalization performance," IEEE Transactions on Neural Networks, vol. 5, no. 5, pp. 848{851, 1994. [22] C. Giles and C. Omlin, \Rule re nement with recurrent neural networks," in Proceedings IEEE International Conference on Neural Networks (ICNN'93), vol. II, pp. 801{806, 1993. [23] M. Goudreau, C. L. Giles, S.Chakradhar, and D. Chen, \First-order vs. second-order single-layer recurrent neural networks," IEEE Transactions on Neural Networks, vol. 5, no. 3, pp. 511{513, 1994. [24] B. Hassibi and D. Stork, \Second order derivatives for network pruning: optimal brain surgeon," in Advances in Neural Information Processing Systems 5 (S. Hanson, J. Cowan, and C. Giles, eds.), (San Mateo, CA), pp. 164{171, Morgan Kaufmann, 1993. [25] J. Hopcroft and J. Ullman, Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley Publishing Company, Inc., 1979. [26] S. Kremer, \Comments on \constructive learning of recurrent neural networks: ...", cascading the proof describing limitations of recurrent cascade correlation," IEEE Transactions on Neural Networks, 1996. In press. [27] K. Lang, \Random dfa's can be approximately learned from sparse uniform examples," in Proceedings of the Fifth ACM Workshop on Computational Learning Theory, (Pittsburgh, PA), July 1992. [28] Y. LeCun, J. Denker, and S. Solla, \Optimal brain damage," in Advances in Neural Information Processing Systems 2 (D. Touretzky, ed.), (San Mateo, CA), Morgan Kaufmann Publishers, 1990. [29] R. Maclin and J. Shavlik, \Re ning algorithms with knowledge-based neural networks: Improving the Chou-Fasman algorithm for protein folding," in Computational Learning Theory and Natural Learning Systems (S. Hanson, G. Drastal, and R. Rivest, eds.), MIT Press, 1992. [30] M. Mozer and P. Smolensky, \Skeletonization: A technique for trimming the fat from a network via relevance assessment," in Advances in Neural Information Processing Systems 1 (D. Touretzky, ed.), (San Mateo, CA), pp. 107{115, Morgan Kaufmann Publishers, 1989. [31] C. Omlin and C. Giles, \Extraction of rules from discrete-time recurrent neural networks," Neural Networks, vol. 9, no. 1, pp. 41{52, 1996. 21

[32] C. Omlin and C. Giles, \Extraction of rules from discrete-time recurrent neural networks," Neural Networks, 1994. Accepted for publication. [33] C. Omlin and C. Giles, \Rule revision with recurrent neural networks," IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 1, pp. 183{188, 1996. [34] C. Omlin and C. Giles, \Training second-order recurrent neural networks using hints," in Proceedings of the Ninth International Conference on Machine Learning (D. Sleeman and P. Edwards, eds.), (San Mateo, CA), pp. 363{368, Morgan Kaufmann Publishers, 1992. [35] C. Omlin, C. Giles, and C. Miller, \Heuristics for the extraction of rules from discrete-time recurrent neural networks," in Proceedings International Joint Conference on Neural Networks 1992, vol. I, pp. 33{ 38, June 1992. [36] D. Opitz and J. Shavlik, \Dynamically adding symbolically meaningful nodes to knowledge-based neural networks," Knowledge-Based Systems, pp. 301{311, 1996. [37] J. Pollack, \The induction of dynamical recognizers," Machine Learning, vol. 7, pp. 227{252, 1991. [38] D. Servan-Schreiber, A. Cleeremans, and J. McClelland, \Graded state machine: The representation of temporal contingencies in simple recurrent networks," Machine Learning, vol. 7, p. 161, 1991. [39] J. Shavlik, \Combining symbolic and neural learning," Machine Learning, vol. 14, pp. 321{331, 1994. Extended abstract of an invited talk given at the Eighth International Machine Learning Conference (ML'92). [40] S. Suddarth and A. Holden, \Symbolic neural systems and the use of hints for developing complex systems," International Journal of Man-Machine Studies, vol. 34, p. 291, 1991. [41] G. Towell, M. Craven, and J. Shavlik, \Constructive induction using knowledge-based neural networks," in Eighth International Machine Learning Workshop (L. Birnbaum and G. Collins, eds.), (San Mateo, CA), p. 213, Morgan Kaufmann Publishers, 1990. [42] G. Towell and J. Shavlik, \Knowledge-based arti cial neural networks," Arti cial Intelligence, vol. 70, 1994. To appear. [43] G. Towell, J. Shavlik, and M. Noordewier, \Re nement of approximately correct domain theories by knowledge-based neural networks," in Proceedings of the Eighth National Conference on Arti cial Intelligence, (San Mateo, CA), p. 861, Morgan Kaufmann Publishers, 1990. [44] V. Tresp, J. Hollatz, and S. Ahmad, \Network structuring and training using rule-based knowledge," in Advances in Neural Information Processing Systems 4 (C. Giles, S. Hanson, and J. Cowan, eds.), (San Mateo, CA), Morgan Kaufmann Publishers, 1993. 22

[45] R. Watrous and G. Kuhn, \Induction of nite-state languages using second-order recurrent networks," Neural Computation, vol. 4, no. 3, p. 406, 1992. [46] R. Williams and D. Zipser, \A learning algorithm for continually running fully recurrent neural networks," Neural Computation, vol. 1, pp. 270{280, 1989. [47] X. Yao, \Evolutionary arti cial neural networks," Int. Journal of Neural Systems, vol. 4, no. 3, pp. 203{ 222, 1993. [48] Z. Zeng, R. Goodman, and P. Smyth, \Learning nite state machines with self-clustering recurrent networks," Neural Computation, vol. 5, no. 6, pp. 976{990, 1993.

23