Learning stochastic regular grammars with recurrent neural networks

6 downloads 0 Views 405KB Size Report
duced. 1 Introduction. A number of papers (Giles et al. 1992a,b; Watrous and Kuhn 1992a,b) have explored the ability of second-order recurrent neural networks ...
Learning stochastic regular grammars with recurrent neural networks Rafael C. Carrasco and Mikel L. Forcada

Departamento de Lenguajes y Sistemas Informaticos Universidad de Alicante E-03071 Alicante, Spain. E-mail: fcarrasco, [email protected] Abstract

Recent work has shown that the extraction of symbolic rules improves the generalization power of recurrent neural networks trained with complete samples of regular languages. This paper explores the possibility of learning rules when the network is trained with stochastic data. For this purpose, a network with two layers is used. If an automaton is extracted from the network after training and its transition probabilities estimated from the sample, the relative entropy with respect to the sample gets reduced.

1 Introduction A number of papers (Giles et al. 1992a,b; Watrous and Kuhn 1992a,b) have explored the ability of second-order recurrent neural networks to learn simple regular grammars from complete (positive and negative) training samples. The generalization performance of a nite state automaton extracted from the trained network is better than that of the network itself (Giles et al. 1992a,b; Omlin and Giles 1996). In most cases, they nd that once the network has learned to classify all the words in the training set, the states of the hidden neurons visit only small regions (clusters) of the available con guration space during recognition. These clusters can be identi ed as the states of the inferred deterministic nite automaton (DFA) and the transitions occurring among these clusters when symbols are read determine the transition function of the DFA. Although automaton extraction remains a controversial issue (Kolen 1994), recently Casey (1996) has shown that a recurrent neural network can robustly model a DFA if its state space is partitioned into a number of disjoint sets equal to the number of states in the minimum equivalent automaton. An automaton extraction algorithm can be found in Omlin and Giles (1996). The extraction is based on partitioning the 1

con guration hypercube [0:0; 1:0]N in smaller hypercubes, such that at most one of the clusters falls into each de ned region. An improved algorithm which does not restrict clusters to point-like form is described in Blair and Pollack (1996). However, complete samples, which include both examples and counterexamples, are scarce and, in most cases, only a source of random (or noisy) examples is available. In such situations, learning the underlying grammar will clearly improve the estimation of the probabilities, above all, for those strings which do not appear in the sample. In this paper, the possibility of extracting symbolic rules from the network trained with stochastic samples is explored. The architecture of the network is described in next section, and it can be viewed as an Elman (1990) net with a second-order next-state function, as used in Blair and Pollack (1996). The related work of Casta~no et al. (1993) uses an Elman architecture to predict the probability of the next symbol, but it does not address the question of what symbolic representation is learned by the net. An algorithmic approach to similar questions can be found in Carrasco and Oncina (1994).

2 Network architecture Assume that the input alphabet has L symbols, including the end-of-string symbol. Then, the second-order recurrent neural network consists of N hidden neurons and L input plus L output neurons, whose states are labeled Sk , Ik and Ok respectively (see Fig. 1). The network reads one character per cycle (with t = 0 for the rst cycle) and the input vector is Ik = kl when al (the l-th symbol in the alphabet) is read1 . Its output Ok[t+1] represents the probability that the string is followed with symbol ak . The dynamics is given by the equations:

Ok[t] = g(

N X j =1

Vkj Sj[t] )

N X L X [ t ] Si = g( Wijk Sj[t?1] Ik[t?1] ) j =1 k=1

(1) (2)

where g(x) = 1=(1 + exp(?x)). The network is trained with a version of Williams and Zipser's (1989) Real Time Recurrent Learning (RTRL) algorithm. The contribution of every string w in the sample to the energy (error function) is given by jwj X Ew[t] t=0 1 The Kronecker's delta  is 1 if k = l and zero otherwise. kl

Ew =

2

(3)

output

V

kj

W

ijk

symbol

state

Figure 1: Second-order recurrent neural network with N = 3 and L = 2

3

where

L  2 X Ok[t] ? Ik[t] Ew[t] = 21 k=1

(4)

It is not dicult to prove that this energy has a minimum when Ok[t] is the conditional probability of getting symbol ak after reading the rst t symbols in w. Because a probability one (or zero) is dicult to learn, a contractive mapping f (x) =  + (1 ? 2)x, with  a small positive number, is used. This mapping does not a ect the formalism presented here. Usually, the weights are set to random values and then learned, but the initial states of the hidden neurons Si[0] cannot change during training. As remarked in Forcada and Carrasco (1995), there is no reason to force the initial states to remain xed, and the behavior of the network improves if they are also learned. This is even more important here, because the predicted probability Ok[t] depends on the states Si[t] , while in a mere classi cation task it suces that the accepting states are in the acceptance region and the rest in the non-acceptance region. In order to keep Si[t] within the range [0:0; 1:0], one rather de nes the net inputs i[t] such that Si[t] = g(i[t] ) and take i[0] as the parameters to be optimized. Therefore, all parameters Vkj ; Wijk and i[0] are randomly initialized before training, but then learned together. Every parameter P is updated according to gradient descent: 0 (5) P = ? @E @P +  P with a learning rate and a momentum term . The value 0P represents the variation of parameter P at previous iteration. The contribution of every symbol in w to the energy derivative is: N L [t] @Ew[t] = X [t] ? I [t] )O[t] (1 ? O[t] ) X( @Vba S [t] + V S [t] (1 ? S [t] ) @a ) (6) ( O ba a a a @P b @P b=1 b b b a=1 @P where the derivative of the sigmoid function g0 (x) = g(x)(1 ? g(x)), was used. The derivatives with respect to Vkj may be evaluated in a single step:

@Ew[t] = (O[t] ? I [t] )O[t] (1 ? O[t] )S [t] k k k k j @Vkj

(7)

However, the derivatives with respect to i[0] and Wijk are computed in a recurrent way, using the following equations:

@m[0] = 0 @Wijk 4

(8)

@i[0] =  ij @j[0]

(9)

L N [t?1] @i[t] = X [t?1] (1 ? S [t?1] ) X Wiab @a I [t?1] S (10) a @j[0] a=1 a @j[0] b b=1 N L [t?1] @m[t] =  S [t?1] I [t?1] + X [t?1] (1 ? S [t?1] ) X Wmab @a I [t?1] (11) S im a a j k @Wijk @Wijk b a=1 b=1

3 Results and discussion The behavior of the network has been checked using the following stochastic deterministic regular grammar: S ! 0S (0:2) S ! 1A (0:8) A ! 0B (0:7) (12) A ! 1S (0:3) B ! 0A (0:4) B ! 1B (0:1) where the numbers in parenthesis indicate the probability of the rules. The end of string probability for a given variable is the di erence between 1 and the sum of the probabilities of all rules with that variable on the left. The number of input/output neurons is L = 3 (corresponding to 0, 1 and end of string respectively) while the number of hidden neurons in the network was set to N = 4. The network was trained with 800 random examples2 and all of them were used at every iteration. The learning rate was = 0:004 and the momentum term  = 0:2. The training process for a given sample involved 1000 iterations. Figure 2 shows a typical section of the con guration space [0:0; 1:0]N for a trained net. Every point represents a vector S [t] , generated for all words of length up to 10. As seen in the gure, only some regions are visited and three clusters appear corresponding to the three variables in the grammar. Clusters were observed in all runs, although in 2 out of 10 the number of clusters was larger, leading to equivalent non-minimized automata. In Fig. 3, the projection to the output space (O1; O2; O3) is plotted. The clusters map into the regions around (0.2,0.8,0.0), (0.7,0.3,0.0) and (0.4,0.1,0.5), in agreement with the true probabilities. Note that even if not required by the model the sum of all three components is 1. Note also that due to the contractive mapping (we take  = 0:1) one gets some values slightly below zero which should be treated as rounding errors. As observed in the picture, the network is able to predict the next symbol probability with good accuracy. 2

The sample contained an average of 200 di erent strings.

5

s_3 1

0.5

0 1

0

0.5

s_2

0.5 s_1 1

0

Figure 2: Con guration space after training with 800 examples. A measure of the similarity between the probability distribution q(w) estimated with the trained network and the probabilities p(w) of the strings in the sample is given by given by the relative entropy or Kullback-Leibler distance (Cover and Thomas 1991): X (13) d(p; q) = p(w) log2 pq((ww)) w which gave an average value of 2 bits in the experiments. However, when the automaton was extracted from the network and its transition probabilities estimated using the sample, the distribution q(w) generated by this stochastic automaton reduced the average distance d(p; q) to 1.2 bits. As the sample had 5.2 bits of entropy in average, it is clear that the automaton extracted from the network with experimentally estimated transition probabilities outperforms the network in modeling the stochastic language. 6

4 Conclusions The generalization power of a second-order recurrent neural network when trained with stochastic samples of regular languages has been studied. It was found that clusters appear in the internal state space corresponding to the states in the nite automaton generating the language. If the probabilities are estimated with the extracted automaton, the distance between the experimental probability distribution and the predicted one reduces substantially. Therefore, second-order neural networks become suitable candidates for modeling stochastic processes when grammatical algorithms are not yet developed.

Acknowledgments We wish to thank the Direccion General de Investigacion

Cient ca y Tecnica of the Government of Spain for support through project CICYT/TIC93-0633-C02.

References  Blair, A.D. and Pollack, J.B. (1996) \Precise Analysis of Dynamical Re     

cognizers" Tech. Rep. CS-95-181, C.S. Department, Brandeis Univ. Carrasco, R.C. and Oncina, J. (1994) \Learning Stochastic Regular Grammars by Means of a State Merging Method" in Grammatical Inference and Applications (R.C. Carrasco and J. Oncina Eds.), Lecture Notes in Arti cial Intelligence 862, Springer-Verlag, Berlin, 139{152. Casey, M. \The Dynamics of Discrete-Time Computation with Application to Recurrent Neural Networks and Finite State Machine Extraction" Neural Computation, to appear. Casta~no, M.A., Casacuberta, F. and E. Vidal, E. \Simulation of Stochastic Regular Grammars through Simple Recurrent Networks", in New Trends in Neural Computation (Eds. J. Mira, J. Cabestany and A. Prieto). Springer Verlag, Lecture Notes in Computer Science 686 (1993) 210{215. Cover, T.M. and Thomas, J.A. (1991) Elements of Information Theory, John Wiley and Sons, New York. Elman, J. (1990) \Finding Structure in Time" Cognitive Science 14, 179{ 211. Forcada, M.L. and Carrasco, R.C. (1995) \Learning the Initial State of a Second-Order Recurrent Neural Network during Regular-Language Inference" Neural Computation 7, 923{930.

7

 Giles, C.L., Miller, C.B, Chen, D., Chen, H.H., Sun, G.Z., and Lee, Y.C.      

(1992a) \Learning and Extracting Finite State Automata with SecondOrder Recurrent Neural Networks" Neural Computation 4, 393{405. Giles, C.L., Miller, C.B, Chen, D., Sun, G.Z., Chen, H.H., and Lee, Y.C. (1992b) \Extracting and Learning an Unknown Grammar with Recurrent Neural Networks", in Advances in Neural Information Processing Systems 4 (J. Moody et al., eds), Morgan-Kaufmann, San Mateo CA, 317{324. Kolen, J.F. (1994) \Fool's Gold: Extracting Finite-State Automata from Recurrent Network Dynamics" in Advances in Neural Information Processing Systems 6 (C.L. Giles, S.J. Hanson, J.D. Cowan Eds.), MorganKaufmann, San Mateo CA., 501{508. Omlin, C.W. and Giles, C.L. (1996) \Extraction of Rules from DiscreteTime Recurrent Neural Networks" Neural networks 91, 41{52. Williams, R.J. and Zipser, D. (1989) \A learning algorithm for continually running fully recurrent neural networks" Neural Comp. 1, 270. Watrous, R.L. and Kuhn, G.M. (1992a) \Induction of Finite-State Automata Using Second-Order Recurrent Networks" Advances in Neural Information Processing Systems, vol. 4 (J. Moody et al., eds; MorganKaufmann, San Mateo, Calif.), 306{316. Watrous, R.L. and Kuhn, G.M. (1992b) \Induction of Finite-State Languages Using Second-Order Recurrent Networks" Neural Computation 4, 406{414.

8

1

0.8

O_2

0.6

0.4

0.2

0

0

0.2

0.4

0.6

0.8

1

O_1

1

0.8

O_3

0.6

0.4

0.2

0

0

0.2

0.4

0.6

0.8

O_1

Figure 3: Output probabilities after training.

9

1