Inductive Bias in Context-Free Language Learning - Computer

0 downloads 0 Views 158KB Size Report
ABSTRACT. Recurrent neural networks are capable of learning context-free tasks, however learning perfor- mance is unsatisfactory. We investigate the effect of ...
Inductive Bias in Context-Free Language Learning



Brad Tonkes and Alan Blair and Janet Wilesy Department of Computer Science and Electrical Engineering y School of Psychology University of Queensland fbtonkes,blair,[email protected]

ABSTRACT Recurrent neural networks are capable of learning context-free tasks, however learning performance is unsatisfactory. We investigate the e ect of biasing learning towards nding a solution to a context-free prediction task. The rst series of simulations xes various sets of weights of the network to values found in a successful network, limiting the search space of the backpropagation through time learning algorithm. We nd that xing similar sets of weights can have very di erent e ects on learning performance. The second series of simulations employs an evolutionary hill-climbing algorithm with an error measure that more closely resembles the performance measure. We nd that under these conditions, the network nds di erent solutions to those found by backpropagation, and is even biased towards nding these solutions. An unexpected result is that the hill-climbing algorithm is capable of generalisation. The two simulations serve to highlight that seemingly similar biases can have opposite e ects on learning.

1. Introduction While a large body of research has focused on the relationship between recurrent neural networks (RNNs) and deterministic nite automata (DFA) (see for example [2, 4]), recent results have shown RNNs capable of learning context-free languages [5, 7] and have suggested that RNNs may have a bias towards inducing non-regular languages [1]. This result gives us a timely warning that when we consider language induction, we should remember the types of biases that are inherent in the learning situation. The induced language is related not only to the training language but also to the induction biases of the system. Sources of inductive bias can be broadly categorised into three classes: data, architecture and the learning algorithm. There is also some bias as a result of the interaction between these sources. Adding one source of bias can e ect what other sorts of biases may be added. For example, in the second series of simulations, changing the error measure used for learning e ects the type of learning algorithms that can be applied. A way of adding bias to language induction in RNNs is the choice of training data. Training data may be manipulated by changing the distribution or order of strings in the training set. A number of results indicate that it is advisable to bias RNNs by using shorter strings early in training [3]. Ideally, adding biases to a system should improve

the performance of learning. Improvement may be realised by an increase in eciency : the amount of training required to nd a solution; reliability : the regularity with which a successful network is induced; or consistency : whether or not continued training improves performance on the task. This study investigates the e ect of adding biases to a network learning a context-free task: prediction of the next letter in a sequence from the language n n. The solution to this task is well understood [5, 7]. Learning this task with backpropagation through time appears to be inecient, unreliable and particularly inconsistent [6]. With this previous result in mind we consider two ways of biasing our system and examine the resulting e ects on learning performance. a b

2. Backpropagation Through Time 2.1. Issues The standard backpropagation through time algorithm (BPTT) [8] is not particularly e ective at inducing a network that performs the n n prediction task. It typically fails to nd a solution in a reasonable amount of time [7], and any solution that is found is lost with further training [6]. Consequently, we consider mechanisms to bias learning towards nding desirable solutions. One possible way of adding bias to learning is to simplifythe network so that BPTT does not need to a b

t as many parameters. Since the architecture we are using is of close to minimal size, we cannot simply remove weights. An alternative is to x some of the weights by setting them to values found in a successful network and not modifying them during training. Rather than xing random collections of weights, we choose speci c groups that contribute to some property of the dynamics. By xing collections of weights, we reduce the size of the space that BPTT searches. Ideally, there should be some set of weights to x that restricts BPTT to searching along smooth directions of the error surface, improving the consistency and eciency of learning. Additionally, restricting the search space of BPTT should not prevent it from nding a solution | reliability should also be improved.

2.2. Simulation

One network was trained with BPTT on the n n prediction task, with varying from 1 to 10. At a point where the network successfully performed the task for all of the training strings and generalised to = 11 and = 12, the weights of the network were saved. Subsequent networks had some of their weights xed to these values, shown in gure 1. This network displayed dynamics similar to those shown in gure 2 (bottom). Twelve di erent sets of weights were selected to be xed during training. We report the seven most interesting cases. These seven cases set the following weights to values found in a successful network, and x them at these values during learning. 1. The weights into and between the hidden units, and hidden unit biases. 2. The hidden to output weights, and output biases (a) only; (b) and recurrent weights; (c) and mutual-recurrent weights1 ; (d) and self-recurrent weights. 3. The recurrent weights. 4. The non-recurrent weights, and all biases. For each of these seven conditions, 25 or 50 networks were trained with a sequence of 10000 strings from the language n n, with more shorter strings than longer strings. Training strings were between 2 ( = 1) and 20 ( = 10) characters long. After each string in the sequence had been presented, the weights of the network were saved and the performance of the network tested for varying from 1 to 12. This process yields 10000 test points during training. a b

n

n

n

a b

1.45

1.45

-10.4 -0.408

10.4 0.408

-6.69

-6.61

10.4

-0.408 -1.45

-10.4

-1.45

0.408 11.7

-6.69

-3.41

11.7 -3.41

-6.61

-8.10

-7.34 -1.75

-0.332

9.05

5.17

Fig. 1: The weights of the successful network and the training condition where the output unit weights and biases and the recurrent weights are xed.

The weights of the network that were not xed were initialised to random values between -1 and 1. A learning rate of 0.05 and a momentum term of 0.3 were applied during learning. Nine copies of the hidden units were maintained by the algorithm. These parameters are identical to those used in [6]. The consistency of learning is characterised by the number of these tests on which the network is able to predict all 12 test strings. Similarly, we measure the eciency of learning by the number of strings presented before the rst successful test. The number of networks that ever give a positive test result indicates the reliability of learning.

2.3. Results

Table 1 shows how successful BPTT was under each of the xed weight conditions, as well as the results reported in [6] when no weights were xed. Fixed Weights

Nets

None 100 Hidden Unit 50 Output Unit 50 & Rec. 50 & Mut-Rec. 25 & Self-Rec. 25 Non-Rec. 25 Recurrent 50

% Success (Reliability)

13 44 52 0 0 96 48 0

Correct (Consist.)

20 44 85 | | 23 94 |

Fastest (Effic.)

 5000 70 2092 | | 540 5440 |

Table: 1: Results of simulations. Listed are the tied weights, the number of networks trained, the percentage of the trained networks that were successful, the maximum number of times a network tested correctly, and the minimum number of strings required before a network tested correctly.

Not surprisingly, the case where all hidden unit weights were xed (case 1) found solutions very quickly (70 compared to 5000 strings). In this situation only the output hyperplanes are learned. However, the network was unable to maintain the solution, with the hyperplane often misclassifying 1 From the rst hidden unit to the second, and vice-versa. the longer strings. We believe that this result is a n

n

n

space based on error, an EA searches in a randomised fashion. For this reason we expected the EA to require a considerable amount of time to nd solutions. Hidden Unit Trajectories 1

HU2 Activation

0.8

a region

0.6 b region 0.4

0.2

0

0

0.2

0.4

0.6

0.8

1

0.8

1

0.8

1

HU1 Activation 1

0.8

HU2 Activation

consequence of the distribution of strings, with the greater proportion of shorter strings tending to pull the hyperplane closer to the \b" saddle point. The `dual' to this network is the one with only the weights into the output units xed (case 2a). The networks trained under these conditions were more successful and still signi cantly faster than the \standard" case. Learning with these biases was comparably reliable and more consistent than the `dual' condition, though not as ecient. Given this success, we expected that xing the output hyperplanes as well as some of the recurrent weights would have improved the performance of BPTT further. However, when in addition to the output hyperplanes, all of the recurrent weights (case 2b) or the mutual recurrent weights (case 2c) were xed, BPTT completely failed to nd a correct network. Incongruously, xing the output unit weights and the self-recurrent weights (case 2d) resulted in a 96% success rate, as well as signi cantly boosting the eciency of learning, although reducing consistency. Fixing the non-recurrent weights (case 3) of the network signi cantly improved reliability, but learning was neither consistent nor more ecient than the standard case. No network with all recurrent weights xed (case 4) found a solution.

0.6 a region 0.4

0.2

0

b region

0

0.2

0.4

0.6

HU1 Activation

3. Evolutionary Hill Climbing

1

3.1. Issues The rst simulation focused on biasing learning by restricting the search space. In a second series of simulations, we consider an alternative method of adding bias by changing the error measure used during learning. BPTT performs pseudo-gradient descent on the mean squared error (MSE) of the network, whereas the performance of the network is evaluated by the length of strings that are correctly predicted based on a 0.5 output threshold. Consequently, the next series of simulations involve choosing an error measure that is commensurate with our method of evaluation, based on the number of correct strings (NCS). The NCS for a network is the largest such that the network can successfully predict all strings from = 1 to = . We used a hill-climbing evolutionary algorithm (EA) with a two-tier error measure to learn the task. The rst discrimination made between two networks is the NCS measure, with MSE used only as a secondary discrimination between networks. The advantage of this algorithm is improved consistency. Since the EA hill-climbs on NCS, it also hill-climbs on our performance measure. One possible drawback to using an evolutionary algorithm is the eciency of the approach. Whereas BPTT performs a directed search of the

HU2 Activation

0.8

a region

0.6 b region 0.4

0.2

0

0

0.2

0.4

0.6

HU1 Activation

Fig. 2: Hidden unit trajectories and output hyperplane for a8 b8 for the three solution classes. Top: monotonic solution with signature (2,0); middle: exotic oscillating solution with signature (1,1); bottom: typical oscillating solution with signature (0,2).

N

n

n

N

3.2. Simulation

The network architecture was slightly altered for this simulation. In order to reduce redundant weights in the network that would have slowed the algorithm, only one output unit was used. The initial network is generated with random weights chosen from a gaussian distribution with standard deviation 0.1. At each generation, a new network is generated from the current network by adding to each weight a gaussian random variable with standard deviation

n

n

N

N

n

M

M > N

n

a

N

b

3.2.1. Actual Trials

3.3. Results

Eighteen of fty runs found networks with NCS  12. Of the 50 nal networks, 5 failed to predict any strings correctly, consequently did not produce any output and were not classi ed. The performance of the EA is summarised in table 2. signature unsuccessful successful total (2,0) 2 11 13 (0,2) 15 6 21 (1,1) 10 1 11 | 5 0 5 total 32 18 50 Table: 2: Success of the evolutionary algorithm at inducing correct networks. The algorithms is judged to be successful if it induces a network that can correctly perform the prediction task for n varying from 1 to 12. For the purposes of this table, the signature of a network is taken from the network at the nal generation.

As table 2 shows, networks with signature (2 0) were most likely to be successful, those with signature (1 1) least likely. Figure 3 plots the average depth as a function of generation, grouped by signature. ;

;

Evolutionary Algorithm Efficiency for Three Solution Classes 20 monotonic: (2,0) 18 16

Average Depth

0.01. Suppose the current network can correctly predict2 all strings an bn for 1   , but fails for = +1. If the new network can correctly predict anbn for 1   , with , then replace the current network with the new one. If the new network can correctly predict anbn for 1   , calculate the mean squared error it accrues while processing aN +1 bN +1 . If this error is lower for the new network than the current network, replace the current network with the new one. In preliminary training, we ran this algorithm three times, and found that in each case it was able to nd a solution achieving depth greater than 30. Once the depth reaches about 15, it typically begins to jump by 2 or 3 at a time. This phenomenon can be considered as a form of generalisation. Although the network is being evolved to the next largest string, it is generalising to even longer strings. One of these networks found an oscillating solution ( gure 2 bottom) of the kind commonly produced by backpropagation [7]. One found a monotonic solution which counts by using two attractors rather than an attractor and a saddle point ( gure 2 top). Solutions of this kind had been found by backpropagation, but typically were unable to generalise to longer strings [5]. It is interesting, therefore, that this monotonic network demonstrated considerable capacity for generalisation. The third run (actually a slight variant of the above algorithm) produced an unusual solution of a kind that had not previously been observed ( gure 2 middle). In this network, \counting" of both the s and s is performed by the same hidden unit, with the other unit used to \switch mode".

14 typical oscillating: (0,2)

12 10 8

exotic: (1,1)

6 4

To test the reliability of this algorithm, we then performed 50 more runs for 100,000 generations each.3 The average number of strings tested per generation was 5 7 . The average length of each string tested was 9 3 . A run was de ned as \successful" if it achieved depth 12 within 100,000 generations. The weight sets produced by this algorithm can be classi ed according to the signature ( ) of the recurrent weights, where and are the number of positive and negative self-weights, respectively. Networks with signature (0 2) always exhibited oscillating behaviour, while those with signature (2 0) were monotonic in nature. Of those with signature (1 1) about half were oscillating and half monotonic. The exotic solution of gure 2 (middle) also had signature (1 1). :

:

p; q

p

q

2 0

0

20000

40000

60000

80000

100000

Generation

Fig. 3: Eciency of the algorithm at inducing correct networks for each of the three solution classes. Note that the plots are not always increasing because the signature occasionally changes during the course of a run.

4. Discussion

The two groups of simulations explored the e ects of adding bias on learning performance on the n n prediction task. The rst series of simulations biased learning by restricting the search space in different ways and found that there was no simple relationship between the xed weights and learning performance. When the output unit weights and bi2 Excluding the non-deterministic step. ases were xed, BPTT found a solution around 50% 3 This number of generations was selected to be compaof the time (compared with 13% in the unrestricted rable to the number of weight updates made by BPTT on 10000 strings. case). Additionally xing the self-recurrent weights ;

;

;

;

a b

resulted in near-perfect reliability, but when all recurrent weights were xed, BPTT never found a solution. The dramatic change in performance may be a result of preventing BPTT from searching in necessary regions of the space. Overall, the performance of BPTT proved to be very sensitive to restrictions to the search space. The performance of the evolutionary hillclimbing algorithm was somewhat surprising. Not only did the algorithm nd solutions within a number of weight updates comparable to BPTT, but the best trials found solutions that could perform the task for very long strings | indeed the best solutions we have found yet (up to = 53). Furthermore, the evolutionary algorithm found not only the monotonic solution that BPTT does not nd very easily [5], but a solution that BPTT has never found. In contrast to BPTT, the evolutionary algorithm seems to have a preference for monotonic solutions, inducing more successful networks of this type than successful oscillating networks. The most surprising nding of the simulations was the way in which the evolutionary algorithm generalised. Networks with both oscillating and monotonic dynamics often increased their depth of processing by more than one string in a single generation. In one case a single update improved the performance of a monotonic network from = 33 to = 45. Since the error measure used by the algorithm is aimed speci cally at only the next longest string, there seems to be some bias inherent in the network towards inducing longer strings. n

n

n

5. Conclusions

tion. The simulations detailed in this paper are part of work in progress. One of the xed-weight conditions signi cantly improved reliability, and others improved eciency. Furthermore, the EA was able to learn consistently, unlike BPTT. Although no single bias improved the reliability, eciency and consistency of learning, the individual biases gave signi cant clues as to what makes an e ective bias. Based on these clues we seek biases that improve all three measure of learning performance.

Acknowledgements This work was supported by an APA to Brad Tonkes, a UQ Postdoctoral Fellowship to Alan Blair, an ARC grant to Janet Wiles, and by the CSEE department.

References [1] A. Blair and J. Pollack. Analysis of dynamical recognizers. Neural Computation, 9(5):1127 { 1142, 1997. [2] M. Casey. The dynamics of discrete-time computation, with application to recurrent nerual networks and nite state machine extraction. Neural Computation, 8(6):1135 { 1178, 1996. [3] J. Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48:71 { 99, 1993. [4] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee. Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4(3):393, 1992. [5] P. Rodriguez, J. Wiles, and J. Elman. A dynamical systems analysis of a recurrent neural network that learns to represent the structure of a simple context-free language. Under review. [6] B. Tonkes and J. Wiles. Learning a context-free task with a recurrent neural network: An analysis of stability. To appear: Proceedings of the

Not surprisingly, our simulations have shown that adding bias to a system has an impact on learning. However the extent of this impact is somewhat surprising. The rst series of simulations demonstrated that applying two seemingly similar biases could result in near-perfect reliability or failure to nd any solutions. From this we gather that learning is particularly sensitive to biases. Fourth Biennial Conference of the Australasian Another unexpected result was that the hillCognitive Science Society, 1997. climbing algorithm induced successful monotonic [7] J. Wiles and J. Elman. Learning to count withnetworks unlike BPTT. Indeed, the EA actuout a counter: A case study of dynamics and ally appeared biased towards nding this solution, activation landscapes in recurrent networks. In with more successful networks displaying monoProceedings of the Seventeenth Annual Confertonic rather than oscillating dynamics. In light of ence of the Cognitive Science Society, pages 482 this result, one can see how RNNs may be biased to{ 487, Cambridge, MA, 1995. MIT Press. wards inducing non-DFA-like dynamics even when [8] D. Zipser. Subgrouping reduces compexity and trained on regular languages [1]. speeds up learning in recurrent networks. In This possibility is given even more credence by D.S. Touretzky, editor, Advances in Neural Inthe nding that the evolutionary hill-climbing algoformation Processing Systems II, pages 638 { rithm was capable of generalisation. The way that 641. Morgan Kaufmann Publishers, 1990. generalisation occurs | in a context-free manner | suggests that RNNs may be biased towards nding solutions that are not analogous to DFA computa-