Context Free Grammar Representation in Neural ... - Semantic Scholar

12 downloads 1924 Views 107KB Size Report
Learning complex context free languages with a homogeneous neural mechanism looks ... computation (without addressing learning) in a class of analog computers called Dynamical .... For illustration, the trajectory corresponding to the string.
Context Free Grammar Representation in Neural Networks Whitney Tabor Department of Psychology Cornell University Ithaca, NY 14853 Email: [email protected] Category: Theory (Cognitive Science) Preference: Oral presentation

Abstract Neural network learning of context free languages has been applied only to very simple languages and has often made use of an external stack. Learning complex context free languages with a homogeneous neural mechanism looks like a much harder problem. The current paper takes a step toward solving this problem by analyzing context free grammar computation (without addressing learning) in a class of analog computers called Dynamical Automata, which are naturally implemented in neural networks. The result is a widely applicable method of using fractal sets to organize infinite state computations in a bounded state space. This method leads to a map of the locations of various context free grammars in the parameter space of one dynamical automaton/neural net. The map provides a global view of the parameterization problem which complements the local view of gradient descent methods.

1. Introduction A number of researchers have studied the induction of context free grammars by neural networks. Many have used an external stack and negative evidence ((Giles et al., 1990), (Sun et al., 1990), (Das et al., 1992), (Das et al., 1993),(Mozer and Das, 1993), (Zheng et al., 1994)). Some have used more standard architectures and only positive evidence ((Wiles and Elman, 1995), (Rodriguez et al., ta)). In all cases, only very simple context free languages have been learned.

2 Table 1: Grammar 1. S!ABCD S!

A!aS A!a

B!bS B!b

C!cS C!c

D!dS D!d

It is desirable to be able to handle more complex languages. It is desirable to avoid the problem of choosing ungrammatical examples (negative evidence) in an unbiased way (see (Mozer and Das, 1993)). And it is desirable not to use an external stack: such a stack is not biologically motivated; it makes it harder to see the relationship between language learning nets and other more homogeneous neural architectures; it absolves the neural network from responsibility for the challenging part of the task—keeping track of an unbounded memory—thus making its accomplishment fairly similar to another well-studied case, the learning of finite state languages (e.g., (Zheng et al., 1994)). This paper takes a step toward addressing these issues by providing a representational analysis of neurally implementable devices called Dynamical Automata or DAs which can recognize all context free languages as well as many other languages. The approach is less ambitious in the sense that learning is not attempted. On the other hand, it reveals the structural principles governing the DAs’, and corresponding networks’, computations for a wide range of languages. The essential principle, consistent with (Pollack, 1991)’s experiments, is that fractal sets provide a method of organizing recursive computations in a bounded state space. The networks are recurrent, use linear and threshold activation functions and gating units, but have no external stack. An analysis of the parameter space of one simple DA shows a mingling of languages from different complexity classes which is unlike anything that arises by adjusting parameters in a symbolic model, and is more consistent with the observed range of complexities of human languages (Shieber, 1985). Moreover, this global view of the structure of the parameter space presents a contrast to the local view provided by gradient descent methods and may be useful in learning. To be sure, previous analyses have shown how analog devices can simulate Turing machines ((Pollack, 1987), (Siegelmann and Sontag, 1991)) and even recognize nonrecursively enumerable languages ((Siegelmann, 1996), (Moore, 1996))—thus, mere proofs of computational capability at the lower, context free, level are not revealing. However, these prior analyses have focused on complexity classification and have not explored representational implications. The current results enhance these results by providing a parameter-space map and probing the relevance to neural network learning.

2. An example dynamical automaton A fractal is a set of points which is self-similar at arbitrarily small scales. Figure 1a shows a diagram of the fractal called the Sierpinski Triangle (the letter labels in the diagram will be explained presently). The Sierpinski triangle, a kind of Cantor set, is the limit of the process of successively removing the “middle quarter” of a triangle to produce three new triangles. The grammar shown in Table 1 is a context free grammar. This grammar generates strings in the standard manner ((Hopcroft and Ullman, 1979);  denotes the empty string). Examples of strings generated by Grammar 1 are “a b c d”, “a b c d a b c d”, “a b c a a b c d b c d d”. The last case illustrates center-embedding. A pushdown automaton for the language of Grammar 1 would need to keep track of each “abcd” string that has been started but

3 Figure 1: a. An indexing scheme for selected points on the Sierpinski triangle. The points are the analogues of stack states in a pushdown automaton. By convention, each label lists more-recently-added symbols to the left of less-recently-added symbols. b. A sample trajectory of the DA described in Table 2. a. CCC

0.8

CC CCB

CCA C

CBC

CAC CB

CBB

CA CBA

CAB

CAA 0

0.4

BCC

ACC BC

BCB

AC BCA

ACB

ACA

B BBC

A BAC

BB

0.0

BBB

0.0

ABC BA

BBA

0.2

BAB

AAC AB

BAA

0.4

ABB

AA ABA

0.6

AAB

0.8

AAA

1.0

1.0

b.

10 . c

0.6

0.8

11 . d 3. c 7. c 12 . d 0. 8. d 4. a

0.4

9. b

1. a 5. a

0.0

0.2

2. b 6. b

0.0

0.2

0.4

0.6

0.8

1.0

not completed. For this purpose it could store a symbol corresponding to the last letter of any partially completed string on a pushdown stack. For example, if it stored the symbol “A” whenever an embedding occurred under “a”, “B” for an embedding under “b” and “C” for an embedding under “c”, the stack states would be members of fA; B; C g .1 . We can use the Sierpinski Triangle to keep track of the stack states for Grammar 1. Consider the labeled triangle in Figure 1a. Note that all the labels are at the of hypotenuses of ? midpoints  subtriangles (e.g., the label “CB” corresponds to the point, 00::125 ). The labeling scheme is 625 organized so that each member of fA; B; C g is the label of some midpoint (only stacks of cardinality  3 are shown). We define a DA (called “DA 1”) that recognizes the language of Grammar 1 by the Input Map shown in Table 2. The essence of the DA is a two-element vector, ~z, corresponding to a position on the Sierpinski triangle. The DA functions as follows: when ~z is in the subset of the plane specified in the “Compartment” column, the possible inputs are those shown in the “Input” column. Given a compartment and a legal input for that compartment, the change in ~z that results from reading the input is shown in the “State Change” column. 1

For Σ a set of symbols, Σ denotes the set of all finite strings of symbols drawn from Σ.

4 Table 2: Dynamical Automaton (DA 1). Compartment

z1 > 1=2 and z2 < 1=2 z1 < 1=2 and z2 < 1=2 z1 < 1=2 and z2 > 1=2 Any If we specify that the DA must start with ~z

Input b c d a =

State Change

~z ~z ~z ~z

?



~z ? ?1=02 ~z + 1=02  ?  2 ~z ? 1=02 ?  1 z + 1=02 2~

?1=2

to the = , make state changes according ?1=2 rules in Table 2 as symbols are read from an input string, and return to ~z = 1=2 (the Final 1 2

Region) when the last symbol is read, then the computer functions as a recognizer for the language of Grammar 1. To see this intuitively, note that any subsequence of the form “a b c d” invokes the identity map on ~z. Thus DA 1 is equivalent to the nested finite-state machine version of Grammar 1. For illustration, the trajectory corresponding to the string “a b c a a b c d b c d d” is shown in Figure 1b (1. a is the position after the first symbol, an a, has been processed; 2. b is the position after the second symbol, a b has been processed, etc.) One can construct a wide variety of computing devices which organize their computations around fractals. At the heart of each fractal computer is a set of iterating functions (Tabor, sub) which have associated stable states and can be analyzed using the tools of dynamical systems theory (Barnsley, 1993). Hence the name, Dynamical Automaton.

3. The general case and neural implementation The method of Section 2 can be extended to languages requiring any finite number of stack alphabet symbols ((Moore, 1996), (Tabor, sub)). For an alphabet of N symbols, 1 ; 2 ; : : : ; N , consider the functions pi : Rn ! Rn defined by pi(~z) = 12 (~z + ~ei), where ~ei is the vector with a 1 in the ith position and 0’s elsewhere. Let the starting state of the automaton be the vector in Rn with every element equal to 12 . Then an application of pi to ~z corresponds to pushing symbol i onto the stack of a pushdown automaton, and an application of pi?1 corresponds to popping i off the stack. The compartments used in the input map are the sets, pi (S ) where S is the open polygon in Rn with vertices at f~e1; : : : ;~eN g. To make sure the DA never tries to pop a symbol it hasn’t pushed, the input map must be defined so all moves out of compartment pi (S ) always begin with with an application of pi?1 . (Tabor, sub) shows that if the pi are pooling functions on S (i.e., pi(S ) \ pj (S ) = for i 6= j , and [Ni pi (S )  S ), then every stack state corresponds to a unique point in S , provided the start state is outside of [N i pi (S ). Since the current example satisfies this condition, this fractal “memory” never confuses its histories. DAs that obey these conditions and thus emulate pushdown automata are called pushdown DAs (or 1 PDDAs). The pooling functions for the previous example are ~z z +~e1 );~z 12 (~z +~e2 ), 2 (~ 1 and ~z z on the open triangle with vertices at the origin, ~e1 , and ~e2 in R2 . 2~ Dynamical Automata can be implemented in neural networks by using a combination of signaling units and gating units. By a signaling unit, I mean the standard sort of unit which sends out a signal reflecting its activation state to other units it is connected to. By a gating unit, I mean a unit which serves to block or allow transmission of a signal along a connection

5 Table 3: Parameterized Dynamical Automaton M(mL ; mR ). Compartment ?(0;1]

?(001) ?(001) 1

; ;

Figure 2:

Input

State Change ?



~z ! ?mzL z  ~z ! ?mz R+z1  ~z ! mzRz

l r r

1

2

1

2

1

2

M(1=2; 17=8) accepting l3r3. r

z2 r

r (0, 1)

[

z1 (0, 0)

l

(1, 0)

l l

between two other units. All units (signaling and gating) compute a weighted sum of their inputs and pass this through an activation function—either identity or a threshold.

q~z + r) to define the state changes in a DA makes The use of simple affine functions (~z for a simple translation into a network with signaling and gating units. The coefficients q and r determine weights on connections. The connections corresponding to linear terms (e.g., q ) are gated connections. The connections corresponding to constant terms (e.g., r ) are standard connections. When these affine functions can be interpreted as compositions of pooling functions and their inverses it is easy to define a PDDA’s compartments neurally: a conjunction of linear separators isolates each compartment. 4. Navigation in dynamical automaton space As I suggested at the beginning, one incentive for studying neural networks in a dynamical automaton setting is that DA analysis provides a more global view of parameter space than the standard gradient descent procedures. A simple case illustrates this idea. Consider the parameterized dynamical automaton M(mL ; mR ) which operates on the two-symbol alphabet Σ = fl; r g and has the input map shown in Table 3. ? 

?



The starting point for M is the point 10 and the Final Region is the set f [1;11) g. The scalars, mL (“Leftward move”) and mR (“Rightward move”) are parameters which can be adjusted to change the language the DA recognizes. Figure 2 illustrates the operation of 1 n n this dynamical automaton. When 0 < mL = m? R < 1, M recognizes the language l r . 1 When, mL 6= m? languages result. Under every parameterization, R , a variety of interesting M recognizes strings of the form lnrk where k is the smallest integer satisfying mnLmkR  1. This implies that k = [[?n logmR mL ]] where [[x]] denotes the smallest integer greater

6

mL  mR where the simplest (two-rule) context free

3.0

Figure 3: The bands in the space languages reside.

1.5 1.0

m_L

2.0

2.5

1

0.0

0.5

o

1 0.0

0.5

1.0

1.5

2.0

2.5

3.0

m_R

than or equal to x. If mL  1 or mR  1 then the language of M is a finite-state language. If mL < 1 and mL is a negative integer power of mR , then M generates a context free language which can be described with two rules. For example, if mL = 14 and mR = 2, then k = 2n and the language of M is ln r 2n . This language is generated by the context free grammar, fS ! l r r, S ! l S r rg. Non whole-number rational relationships between mL and mR produce more complex context free languages (i.e. languages requiring more rules). Not surprisingly, irrational relationships produce non context-free languages (Tabor, sub). Figure 3 is a map of part of the parameter space mL  mR . The curves show the points at which the simplest (2-rule) context-free grammars reside. Although this analysis considers only a very simple case, it is interesting because it suggests a new way of looking at learning. A map of the regions in parameter space where the simplest languages of a given complexity reside may be useful as a navigational tool in the process of identifying a good model for a data stream. For example, the map may provide insight into how to steer a gradient-descent mechanism away from local minima, or to encourage it to focus on solutions which are not unduly complex. DAs are appealingly compatible with gradient descent learning. Suppose we adopt the following simple method of generating strings with a DA: let the DA generate transitions at random according to its rules, with equal probabilities assigned to transitions out of the same compartment. Define the behavioral distance between DA M (for “Model”) and DA T (for “Target”) as the expected value of the distance between the output probability distributions of M and T upon reading an arbitrary symbol. Then, if the DAs’ transition functions are well-behaved, the parameter space distance shrinks continuously with the behavioral distance between M and T. This situation permits application of gradient descent learning every time a symbol is read, with only positive evidence considered. As in (Wiles and Elman, 1995) and (Rodriguez et al., ta), this makes training more like human learning of natural language and avoids the problem of selecting negative examples.

5. Conclusions A general difficulty with applying neural networks to complex problems is that their learned representations are hard to interpret. Dynamical automaton analysis is a way of using notions from complexity theory to identify useful landmarks in the space of neural rep-

REFERENCES

7

resentations. Such a global perspective may be helpful in surmounting the challenges of non-toy problems.

References Barnsley, M. ([1988]1993). Fractals Everywhere, 2nd ed. Academic Press, Boston. Das, S., Giles, C. L., and Sun, G. Z. (1992). Learning context-free grammars: Capabilities and limitations of of neural networks with an external stack memory. In Proceedings of the 14th Annual Conference of the Cognitive Science Society, pages 791–5. Erlbaum, Hillsdale, NJ. Das, S., Giles, C. L., and Sun, G. Z. (1993). Using prior knowledge in a NNPDA to learn contextfree languages. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 65–72. Morgan Kaufmann, San Mateo, CA. Giles, C., Sun, G., Chen, H., Lee, Y., and Chen, D. (1990). Higher order recurrent networks & grammatical inference. In Touretzky, D., editor, Advances in Neural Information Processing Systems 2, pages 380–7. Morgan Kaufmann Publishers, San Mateo, CA. Hopcroft, J. E. and Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Menlo Park, California. Moore, C. (1996). Dynamical recognizers: Real-time language recognition by analog computers. TR No. 96-05-023, Santa Fe Institute. Mozer, M. C. and Das, S. (1993). A connectionist symbol manipulator that discovers the structure of context-free languages. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 863–70. Morgan Kaufmann, San Mateo, CA. Pollack, J. B. (1987). On connectionist models of natural language processing. Ph.D. Thesis, Department of Computer Science, University of Illinois. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7:227–252. Rodriguez, P., Wiles, J., and Elman, J. (ta). How a recurrent neural network learns to count. Connection Science. Shieber, S. M. (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8. Also in Savitch, W. J., et al. (eds.) The Formal Complexity of Natural Language, pp. 320–34. Siegelmann, H. (1996). The simple dynamics of super Turing theories. Theoretical Computer Science, 168:461–472. Siegelmann, H. T. and Sontag, E. D. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80. Sun, G. Z., Chen, H. H., Giles, C. L., Lee, Y. C., and Chen, D. (1990). Connectionist pushdown automata that learn context-free grammars. In Caudill, M., editor, Proceedings of the International Joint Conference on Neural Networks, pages 577–580. Lawrence Earlbaum, Hillsdale, NJ. Tabor, W. (sub). Metrical relations among analog computers. http://www.cs.cornell.edu/home/tabor/tabor.html.

Draft version available at

Wiles, J. and Elman, J. (1995). Landscapes in recurrent networks. In Moore, J. D. and Lehman, J. F., editors, Proceedings of the 17th Annual Cognitive Science Conference. Lawrence Erlbaum Associates. Zheng, Z., Goodman, R. M., and Smyth, P. (1994). Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks, 5(2):320–30.

Suggest Documents