Generalization from Sparse Input Jeffrey L. Elman University of ...

[To appear in Proceedings of the 38th Annual Meeting of the Chicago Linguistic Society (2003).]

Generalization from Sparse Input Jeffrey L. Elman University of California, San Diego Over the past decade, there has been an increasing awareness of the extent to which the speaker/hearer's language knowledge reflects the fine details of personal experience. This awareness can be seen in theoretical linguistics (e.g., Construction Grammar, various lexicalist theories, Cognitive Grammar, among others); in psycholinguistics (e.g., the Competition Model, constraint based, probabilistic, and connectionist models); and in computational linguistics (e.g., statistically-based approaches that rely on large-scale electronic corpora). In a curious way this reflects a pendulum swing back to older views that predate the generative era. But it is a pendulum swing with a difference. The emphasis over the past 50 years on the abstract nature of linguistic knowledge, and on ways in which linguistic generalizations often crucially refer to that abstract structure, has upped the ante for usage-based approaches. There is now a richer set of data and a more sophisticated awareness of the kinds of phenomena that want explaining. I am a firm believer in usage-based approaches. But I believe that usage has its place. We do more than simply record our past experiences, say, in some table of frequencies or probabilities. Rather, our experience forms the basis for generalization and abstraction. So induction is the name of the game. But it is also important to recognize, that induction is not unbridled or unconstrained. Indeed, decades of work in machine learning makes abundantly clear that there is no such thing as a general purpose learning algorithm that works equally well across domains. Induction may be the name of the game, but constraints are the rules that we play by. And, enthusiast that I am, I would also be the first to acknowledge that we are only now scratching the surface in developing our understanding of how induction in language learning works. In this paper, I would like to discuss several problems that arise as a result of the following conundrum. On the one hand, two things are clear. First, children receive massive amounts of language input. This point is made abundantly clear by the research of Huttenlocher and her colleagues (e.g., Huttenlocher, Levine, & Levea, 1998), Hart and Risely (1995), and others. By some estimates, children may hear as many as 30 million words by age 3 (Hart & Risely, 1995). Second, children are extraordinarily conservative in their productions. They rarely venture very far beyond what they have heard others say. (This can give a misleading impression of linguistic precocity, when in fact children are simply skilled mimics.) This is not a new observation. Bowerman (1976, 1981), Braine (1963, 1976), MacWhinney (1978), and Tomasello (1992) among others, made this point this many years ago, but it seems only recently—perhaps in part as a result of the recent development of large databases of child language (e.g., the CHILDES system, MacWhinney & Snow, 1985)—that the point is recognized as something that needs better understanding. On the other hand, two other things are true. First, although the linguistic input to children is massive, in terms of sheer quantity, it takes a very specific and narrow form. For example Cameron-Faulkner, Lieven, and Tomasello (2002) studied the child-directed

speech in 12 families in Manchester, England. They found that a mere 57 item-specific lexical frames accounted for 50 percent of the input to children (17 item-specific frames accounted for about 45 percent of that input). Moreover, the 12 mothers exhibited a profile of usage that did not seem entirely typical, at least by adult standards. Interestingly, there was a remarkable consistency between the 12 mothers in their pattern of usage. The upshot is that although children hear many words and many sentences, the input is extremely sparse and contains enormous gaps. There are many constructions that are simply not present in the input to young children, and many words are heard only once, or in restricted contexts. Second, although children are conservative in their productions, there obviously comes a point when they move beyond mimicry and produce utterances that depart from their literal experience. This behavior is clear evidence that children have achieved some level of abstract knowledge that frees them from the tyranny of mimicry. Taken together, these observations lead to the following puzzle: Children hear massive input, but it is of limited form; and children are initially conservative in their productions, but ultimately are able to go beyond their input. How can this be? This is the question I propose to consider in this paper. To make the question more concrete, let me begin by giving four examples of instances in which the data might not appear to support the sort of generalizations that one assumes that language learners do in fact make. These can be thought of as cases where there is either not enough information available or—as in the first example—the data that are available predict different behavior than language users actually exhibit. Then I will discuss a set of simulations with artificial neural networks that were designed to explore possible answers to these problems. Puzzle 1. Consider the sentence (1)

I see the book near the table by the lamp and the pen.

This sentence is ambiguous with regard to the role played by the phrase “and the pen.” It could be that the lamp and the pen are a constituent, and the table is by this group. Or, it could be that is the table and the pen which form the group; the book is near the table and the pen, but it is only the table that is by the lamp. Finally, it might be the book and the pen that I see; the book is near the table (which is by the lamp), and we don't know where the pen is. These three alternative readings correspond to what is called, respectively, low, mid, or high attachment, as shown in the tree diagram in Figure 1. Confronted with such attachment ambiguities, and following a usage-based approach, one might imagine that—absent contextual information that might bias her interpretation—a listener’s interpretation might reflect the frequency with which each of the three probabilities occurs in language. Thus, the relative frequency of such attachments, as estimated by consulting large examples of usage found in the Brown Corpus or the Wall Street Journal Corpus might prove a good predictor of a comprehender’s attachment preferences. But in fact Gibson and Schütze (1999) found this not to be the case. They reported that an analysis of the Brown Corpus shows low attachments to be the most frequent, mid 2

attachments to be next most frequent, and high attachment to be the least frequent. In a reading comprehension experiment, Gibson and Schütze presented subjects with sentences that had three potential attachment sites, and asked the subjects to indicate where they thought the and phrase should attach. Subject responses demonstrated an attachment preferences that went in a somewhat different order than that found in the Corpus analysis: low attachments were most preferred, followed by high attachments, and then mid attachments. Thus, the performance data did not precisely mirror the probabilities that are found in text. Gibson and Schütze present these data as problematic for usage based accounts. NP1

high

N1

PP mid

Prep

? and NP

NP2

N2

PP

Prep

low

NP3

N2 Figure 1. Simplified tree indicating the attachment ambiguity of the final “and NP” in (1).

Puzzle 2. Consider now the sentence (2)

The other day I went to a restaurant and had a steak, with a plate of fresh shredded lornet it on the side.

You have probably never encountered the word “lornet” before (as far as I know, this is its first appearance in print). Yet if pressed to describe “lornet”, you probably have definite notions about it; and interestingly, your descriptions are quite likely to closely match those of others. Most people have a sense of its texture (more like cabbage that oatmeal), as well as its color (probably greenish, almost certainly not blue). Where does this knowledge come from? If your language experience does not include this word, can this knowledge truly come from usage? This would seem to be a clear example of our ability to go beyond our experience, although we also have strong intuitions that our tacit knowledge of lornet certainly reflects our knowledge of things that might be used in similar contexts.

3

Puzzle 3. One of the things young English-learning children must learn is the correct usage of productive prefixes. One prefix that is relatively common but somewhat problematic is “un-”. Children must learn that if they can “do” something, they can also “undo” it. Similarly, things that can be “frozen” and “tied” can usually be “unfrozen” and “untied”. What about what about “clench” and “grasp”? The two words have relatively similar meanings, so how is a child to know that having clenched her fist, she can unclench it, but once something has been grasped, it can only be let go (or perhaps forgotten). This problem has been studied by a number of language acquisition researchers, and there is a particularly nice connectionist simulation of this process (Li & MacWhinney, 1996). The problem is emblematic of the more general class of problems in which a child has to learn a generalization based on examples, among which there may be gaps. That is, certain patterns may not occur in a child’s input. How is she to know whether the gap is simply accidental, such that if she waits a bit longer, presumably, she may yet hear the relevant examples; or whether the gap is truly systematic. In this latter case, the non-occurrence of some patterns reflects a qualification on when the generalization applies. The first two puzzles involved cases where the learner needs to go beyond the data; the example here illustrates the importance of knowing when to go beyond and when not to. Puzzle 4. Finally, consider the sentence (3) That woman you told me about who went to Paris and rented a house from the people you put her in touch with is quite nice. The sentence, although long, is perfectly comprehensible. One of the things we know as users of English is that the form of the main verb must be “is” (and not ,for example, “are”). George Miller and Noam Chomsky (1963) used a similar examples to make the point that language learning—at least, learning the more complex aspects of syntax— could not be based on statistics. Their argument follows from the observation that if a language learner tabulates statistics over possible sequences of words, then in order to know that the main verb is in the singular depends on the second word in this particular sentence (“woman”), she would have had to have encountered all possible variants of this sentence containing all possible alternative words in any of the slots between “woman” and “is”. Otherwise, the dependency might be between, for example, “is” and “people”, or even “Paris”. This argument is in fact quite sensible if what one means by “statistical learning”" is the learning of statistics, i.e., recording the frequencies of patterns encountered. But as we shall see, this is an unnecessarily weak notion of learning. A much more sensible view is not that language acquisition involves not the learning of statistics, but rather statistically driven learning. In other words, it’s a matter of induction, not memorization. Taken together, these four puzzles suggest that a more sophisticated view of usage based learning is required than the simple recording of experience might suggest. The examples illustrate the following requirements:

4

(i) Sometimes generalizations disagree with the data. (ii) Generalization is often fast (so-called "one-trial learning"). (iii) The data are often incomplete, and underdetermine the generalization; put in other words, the learner has to know when a gap is accidental, versus when it reflects a systematic fact about the language. (iv) Many generalizations refer to abstract structure. Can that structure itself be induced from the input, given that the structure appears to have a hierarchical component, whereas the input is a linear string of words? These requirements must all be resolved taking into account the conundrum posted at the outset: Children receive massive exposure to language, but it contains huge gaps; and although children are extremely conservative during their early years, rarely departing significantly beyond that which they have heard, they eventually do gain a productive command of language which allows them to produce and comprehend sentences that have never heard before. My goal in what follows is to try to understand how this conundrum might be resolved. I will present three simulations. The first addresses item requirement (i); the second addresses requirement (ii) and (iii); and the third addresses (iv). Simple Recurrent Network Before proceeding, it will be useful to introduce the neural network architecture that will be used in the simulations of describe here, the Simple Recurrent Network (SRN, Elman, 1990). This architecture (shown in Figure 2) was developed for the purpose of processing patterns consisting of sequentially ordered elements. Whereas the traditional feedforword network maps static inputs to static outputs, the SRN produces outputs that are a function both of the current input as well as the SRN’s own prior internal state (saved in the Context layer). This Context layer thus provides the network with a memory. That memory is not a tape recording of prior inputs, however. Instead, the memory records prior internal states. These, in turn, are themselves the results of inputs and prior internal States at previous time steps. OUTPUT LAYER

HIDDEN LAYER CONTEXT LAYER

INPUT LAYER Figure 2. Simple recurrent network. See text for explanation.

5

Because all the connections between units in the network are initialized with random values, at the outset of learning the internal states contain little useful information. The backpropagation learning algorithm (Rumelhart, Hinton, & Williams, 1986) is used to adjust the weights in small increments so that the network’s error (defined as the discrepancy between the network’s actual output and the desired output) decreases over time. If the task being learned requires that the current output depend in some way on prior inputs, then the network will need to learn to identify that prior information, and to save it in the internal states. The problem is particularly challenging when the relevant temporal information has some abstract and non-obvious character. Discovering the notion “word” In Table 1 we see a sequence of inputs; and for each input, the corresponding output the network should produce. Each input consists of a vector of 0s and 1s. The inputs are presented in sequence, one after the other. An SRN similar to that shown in Figure 2 was used to process this time series. The relationship between each of input and the output produces that may seem at first arbitrary. However, there is a very straightforward relationship between the two:

Input (gloss) 01101 m 00001 a 01110 n 11001 y 11001 y 00101 e 00001 a 10010 r 10011 s 00001 a 00111 g 01111 o 00001 a 00010 b 01111 o 11001 y 00001 a 01110 n 00100 d

Output (gloss) 00001 a 01110 n 11001 y 11001 y 00101 e 00001 a 10010 r 10011 s 00001 a 00111 g 01111 o 00001 a 00010 b 01111 o 11001 y 00001 a 01110 n 00100 d 00111 g

Table 1. Each vector in column 1 (Input) was presented in sequence (time runs along the vertical). The SRN was trained to produce the target output shown in column 2.

Each output is simply the input that the network will receive at the next time step. In other words, the task of this network is to predict what the next input will be. 6

If the sequence is completely random, as in the case where input represent flips of a coin, the task is very different. In such a case, the best strategy for the network— given its goal of minimizing prediction error—would be to predict the mean of the possible next of events. Thus, if Heads are represented as 1.0 and Tails are represented as 0.0, the network’s optimum prediction would be 0.5. The greater the structure in the time series (in Information Theoretic terms, the lower the entropy) then in principle, the greater the network’s opportunity for delivering the correct prediction. If we look only at the binary vectors in Table 1, that structure is not very apparent. Nonetheless, there is considerable temporal structure underlying this time series. Each five element vector represents a letter of the alphabet. The first vector, for example represents the letter “m”, the second vector, the letter “a” and so on (the gloss for each vector is shown to the vector’s right in Table 1). In fact, the training sequence whose beginning is shown in Table 1 came from a children’s story (e.g., “Many years ago a boy and girl…”). Of course, the network has no notion that the vectors stand for letters. Indeed, it has no notion of what letters are in the first place, much less words. But there are distributional constraints to the time series. The first vector (representing the letter “m”) is very unlikely to be followed by any of the vectors that represent other consonants and much more likely to be followed by vectors representing vowels. As each successive letter is processed, the constraints typically increase, so that the range of possibilities narrows as more of a word is input. One would expect, therefore, that after training the network’s prediction error would decrease with successive letters of a word. Once the final letter of the word has been encountered, the range of possible next letters that might begin a new word 3.5 3 2.5

(y)

(a)

(s)

(a)

(g)

(m)

(h)

(g)

Error

(a) 2

(e)

0.5

(d) (r) (n) (n)

(a) (r)

(o) (o)

(l)

(s)

(t)

(y)

(y) (e) (a) (e) (e) (d)

(m)

(l) (h)

(v)

(a)

(p)

(h)

(i)

(e)

(y) (y)

0

(b)

(l)

(i)

(b)

1.5 1

(t)

(a)

(p) (p)

(a) (y) (e) (d)

(i) (l) (y)

Figure 3. The SRN’s prediction error after training. Error is typically high at the onset of words, and decreases toward the end. Some sequences (“a boy”) appear to be treated as fixed units.

7

increases. The network’s prediction error should thus increase. In fact, this is precisely the pattern that we see in Figure 3. Figure 3 shows a graph of the network’s error, with each point representing the error in predicting the letter shown in parenthesis. The error is high predicting the first input, the vector corresponding to the letter “m”. Error decreases with successive letters of the word “many”, but then increases when predicting the vector for the letter “y”, which is the initial letter of the next word “years”. If we focus only on the error maxima, we see that these are highly correlated with word onsets—even though these are not explicitly represented in the input. The network’s own performance thus provides it with a potentially very useful kind of information, viz., where the word boundaries are. In some ways, the task of the network is not unlike that of a young infant, hearing speech for the first time. The acoustic stream that an infant hears is unsegmented, with no obvious markers that indicate word boundaries. But if the infant adopts the strategy used by the network in this example, and merely attempts to anticipate what will come next, noticing when her predictions are correct, she might then learn what the implicit units are in the sound stream. It is not difficult to believe that it was exactly this sort of learning strategy that was employed by infants in the artificial grammar learning studies of Gomez and Gerkin (1999) and Saffron, Aslin, and Newport (1996). The SRN architecture has been used in a wide range of language tasks (as well as many non-language tasks). The prediction task turns out to provide a good entry point into learning the implicit lexical semantics of words in simple sentences, using distributional facts to induce the underlying category structure (Elman, 1990). Now let us see how this architecture might be applied to the problems described above. Simulation 1: Sometimes generalizations (appear to) disagree with the data In Puzzle 1, we encountered the problem illustrated by the ambiguous attachment of an “and” phrase, as shown in Example (1) and reproduced here. (1)

I see the book near the table by the lamp and the pan.

The problem was that reader preferences in resolving this ambiguity seems not to reflect the statistical frequency of occurrence of the various possibilities, high-, mid-, and lowattachment (Gibson and Schütze, 1999). In Figure 4 we see an SRN that was trained on the task of attachment resolution. The task was stripped to its bare essentials: The input to the network was a sequence of four elements: NP1, Prep NP2, Prep NP3, “and” NP4. Each constituent was input one at a time. The task of the network, on receiving the final “and” phrase, was to indicate where it attached. Training data were generated to reflect the statistics of attachment sites as estimated from the Brown Corpus. Perusal of this corpus quickly reveals an important fact. The overall number of three-site attachment structures is not terribly large. In fact, there are far more two-site attachments (e.g., “I see the book near the table and the pan.”). And there are by far many more one-site attachments structures (e.g., “I see the book and the pan”). It thus seems reasonable that the networks should be trained not only on three-site attachments, with high, mid and low attachment percentages as determined from the Corpus, but also two-site attachments (with only high and low

8

where to attach?

hidden units

hidden units current NP

Input:

NP1

Prep NP2

Prep NP3

“and” NP4

?

Output: attachment Figure 4. An SRN that is trained to solve the attachment task. Inputs represent a sequence of NPs and PPs, followed by a final NP. The network is trained to indicate where the final NP should be attached.

50

30 low

N1

20 high

%

PREFERENCE

49

low 10

N2

mid high

N1

N1 N2

N3

0

3-site attachments

2-site attachments

1-site attachments

Figure 5. Percentage of attachment sites, estimated from Brown Corpus.

attachments possible) as well as one-site attachments (with no attachment ambiguities at all). The statistics of the training corpus are shown in Figure 5. The result? We see in Figure 6 the network’s attachment preferences for the three sides attachment sentences. Low attachments most preferred, followed by high attachment, with mid attachments least frequent. This is precisely the pattern observed by Gibson and Schütze (1999). The network’s attachment preferences thus do in fact follow in a straightforward manner from the frequency of attachment site in the training data. The obvious and important qualification is that this arises because the network has to learn about attachment over the full range of possibilities which include the simpler two-site and onesite attachment structures as well. Averaged over all of these structures, low attachment, 9

which is really simply local attachment to the most recent NP, is by far and away the most frequent outcome. High attachment, averaged over all the structures, is the next most frequent since it occurs not only three-site attachments but as the only other option into two- and one-site attachments. The opportunity for mid-attachments rises least 100

80

%

PREFERENCE

low 60

40

high

N3

mid

20

N1 N2 0

PREFERRED ATTACHMENT SITE

Figure 6. Attachments site preferences learned by the network.

frequently, only in the three-site structures, and is thus the least preferred by the network. The point here is neither surprising nor difficult to understand, but it is very important. Language learners, be they children or networks, are not learning facts about isolated structures. The statistics that drive learning will typically include a range of possibilities that include many related structures. The generalization to be drawn must in turn capture commonalities and facts that hold true over the larger set as well as the particulars of specific structures. One should therefore expect the outcome not to reflect the statistical patterns of specific sentence types in an overly narrow way. Simulation 2: Going beyond the input Puzzles 2 and 3 appeared to illustrate the following: (ii) Generalization is often fast (the “lornet” example) (iii) The data are often incomplete, and underdetermine the generalization (e.g., how can a child learn that the verb “clench” take the “un-“ prefix, but not the verb “grasp”, given that neither may appear in a child’s input). The problem of going beyond the input occurs even for words that we have seen before, given that many words are likely to have been encountered in only one context (e.g., about half the lexemes in any text sample usually occur only once). Consider the specific problem of generalizing the usage of an NP to novel syntactic roles: An adult language user, having learned a novel NP that occurs in object position, will have no trouble using the same NP as a subject. What about networks? Suppose we train a network to process the following three sentences:

10

(4)

a. Tom loves Mary. b. Fred loves Sue. c. Ted loves Alice.

Such a network will easily generalize what it has learned to the sentences (5)

a. Fred loves Alice. b. Ted loves Mary. c. Tom loves Sue.

That is, the network understands that words that have occurred in Subject position can fill that role in other sentences, just as Objects can replace other Objects. Unhappily this same network will not process the sentences (6)

a. Alice loves Fred. b. Mary loves Ted. c. Sue loves Tom.

The network does not understand that any of the fillers of the Subject role can also fill the Object role (and vice versa). As adult language users, of course, we understand that this is a trivial property of NPs. Of course, comparing a network that knows only three sentences to an adult who has heard many millions of sentences may be unfair. Indeed, as pointed out earlier, an often under-appreciated fact about children is that they are for many years extremely conservative in their productions. An interesting question, thus, is what happens to a network as its language experience increases? Does this provide the basis for increasingly abstract generalizations that may ultimately allow it to “go beyond the input”? This question was addressed in the next simulation. The network’s task was to process a string of sentences, one word at a time, and predict successive words (e.g., Elman, 1990). The sentences were generated by an artificial grammar. Because the main issue addressed was the generalization of a word’s usage to novel syntactic roles (in this case, Subject and Object), the grammar was kept simple, and generated only monoclausal sentences. The grammar can be visualized graphically as consisting of a number of constructions, as in Figure 7 (3 of 7 possible constructions are shown). In addition, words differed with regard to their frequency of occurrence in the grammar (probability of occurrence is shown in Figure 7 by font size; smaller fonts correspond to decreasing frequency). The fact that some words occurred only rarely made it possible to produce corpora in which a given word might occur in only one syntactic position. A set of training corpora was constructed consisting of random sentences from this grammar and ranging in size from very small (20 sentences) to medium (1,000 sentences) to large (5,000). Although there were only 1,030 different possible sentences in this language, the low frequencies of some of the nouns mean that not all possible sentences occurred, even in reasonably large samplings (e.g., 5,000 sentences). In fact, a random sample of approximately .5 million sentences is needed to ensure that (with some certainty) all sentences would be generated. The 1,000-sentence corpus was therefore

11

Figure 7. Several constructions used in artificial grammar. A sentences was generated by choosing a construction at random, and then choosing a subject, verb, and object. Font size indicates probability of a word being selected.

used so that the data on which the network was trained would be very gappy, in order to approximate early stages of language learning. In deliberately constructing a training set for the network which is sparse in this manner, we are able to ask under what conditions the network generalizes across gaps (or doesn’t). For example, in the network’s artificial language, verbs of communication require human subjects and direct objects. Thus, any of “girl”, “boy”, “adult”, “Sue”, or “Bob” must serve as the subject and direct object of the verb “talk-to.” However, with small corpora, particularly given that not all of these words occur equally often, it is possible that one or more may never appear in the training set in one of these roles. In fact, it was easy to find a corpus of 1,000 sentences in which “boy” never appears at all in direct object position (for any verb). The question is then, never having seen “boy” as a direct object (although it does occur in the corpus as subject), will the network be unable to predict “boy” as a possible direct object following the verb “talk-to”? Perhaps surprisingly, the network does predict “boy” in this context. Figure 8 shows the activation of the word nodes for “boy”, as well as mean activation for all Humans, Animals, and all Inanimates. The prediction of “boy” is similar to that of other Humans, although it never appears in this position in the corpus. Why does this occur? The answer is quite straightforward. In this simulation, the network sees only a fraction of the possible sentences. But importantly, although “boy” is never seen in direct object position, it is seen in other contexts in which the other Human nouns occur. For example, Humans (but not Inanimates or Animals) appear as the subject of verbs such as “give”, “transfer”, etc. Conversely, Humans (including “boy”) do not appear as subjects of other verbs (e.g., “terrify”, “chase”, which in this language require Animal subjects). The word “boy”

12

shares more in common with other human words than it does with non-human nouns, or with verbs. In networks, as for humans, similarity is the motive force which drives generalization. Similarity can be a matter form (“who you look like”) or behavior (“who you hang out with”). In this simulation, words were encoded with localist representations, so there was no form-based similarity. But as we have just seen, there were behaviorbased similarities between “boy” and other Human nouns. These more abstract similarities are typically captured in the internal representations that network construct on their hidden layers, and they are what facilitate generalization. The overall behaviors which “boy” shares with “girl”, “Sue”, “Bob”, etc., is sufficient to cause the network to develop an internal representation for “boy” which closely resembles that of the other human words. This can be seen in Figure 9. In this figure, the internal representation of each word is compared to every other word in the lexicon, and the similarity (measured as Euclidean distance in the hidden unit vector space) is used to construct a hierarchical clustering tree. Lexical items that have similar internal representations cluster close in the tree. Prediction following "Girl talks-to..."

Mean activation

0.04 0.04 0.03 0.03

boy HUMAN

0.02 0.02

ANIM(-h) INANIM

0.01 0.01 0.00 1k

5k

10k

Training trials Figure 8. Generalization performance of the network at various stages in training. Early in training (1,000 and 5,000 trials), the network does not predict “boy” as a potential object in this construction. After 10,000 trials, “boy” is predicted, despite never occurring as an object in any construction the network has seen.

13

Figure 9. Hierarchical clustering of internal representations of lexical items. Items with similar internal representations appear close in the tree.

The internal representations for those other words (i.e., humans other than “boy”) must reflect the possibility of their appearing in direct object position following communication verbs (since the network does see many of them occurring that position). Since the representation for “boy” is similar—because “boy” behaves similarly in other positions—“boy” inherits the same behavior. The network’s knowledge about what “boy” can do is very much affected by what other similar words can do. Importantly, if the examples are too scant such generalizations are not made. With a very small corpus, the pattern of interlocking relationships that motivates the abstract categories is not revealed. We see this at the earlier stages of training, in Figure 8. At 1,000 and 5,000 trials, the network does not predict “boy” as a possible object. The network behaves in a very conservative manner, and only generalizes the use of “boy” to novel syntactic roles after 10,000 training trials. This is very much in line with what Tomasello (among others) has noted with children. Categories such as “noun” and “verb” do not start out as primitives; rather they are accreted over time, and at intermediate stages different words are more or less assimilated to what will become adult categories (Olguin & Tomasello, 1993; Tomasello & Olguin, 1993). Early behavior is conservative, and the general category only emerges with time. 14

There is a flip side to this coin, which is that sometimes gaps reflect systematic limitations on a generalization. For instance, the fact that “ungrasp” is not a possible word (although “unclench” is just fine), or that even though both “the ice melted” or “she melted the ice” are acceptable paraphrases, one can say only “the ice disappeared” and not “she disappeared the ice” are examples of gaps in the input that are not accidental but systematic—even if exactly what is systematic about the gap is not obvious. Will the network always generalize through gaps? The answer is no. Such generalizations depend on the relative amount of data and experience which are available. If the word “boy” appears overall with low probability, but there are sufficient other examples to warrant the inference that “boy” has properties similar to other words, the network will generalize to “boy” what it knows about the other words. However, if “boy” is a frequently occurring item, except in one context, the network is less likely to infer that a gap is accidental. It is as if the network realizes that the gap is not due to incomplete data (because the word is very frequent) and so must be the result of a systematic property of the word. We see this phenomenon in Figure 10. Here the network’s predictions at 1,000, 5,000, and 10,000 training trials (as already shown in Figure 8) are reproduced along with the predictions after 80,000 training trials (bars are rescaled to accommodate the large activation values at 80,000 trials. With sufficient experience (10,000 trials) the network is prepared to generalize its knowledge that “boy” appears to be a member of the Human category (suggested by the internal representations shown in Figure 9). But if the gap in usage—i.e., the non-occurrence of “boy” in object position—persists over a long period

Mean activation

Prediction following "Girl talks-to..." 0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00

boy HUMAN ANIM(-h) INANIM

1k

5k

10k

80k

Training trials Figure 10. Generalization performance of the network at various stages in training. Early conservatism (at 1,000 and 5,000 trials) is followed by generalization (at 10,000 trials). With additional training in which the “boy”-as-object gap persists (80,000 trials), the network retreats from its generalizations and assumes the gap is systematic.

15

of time, the network retreats from this generalization. At 80,000 training trials, the network no longer predicts “boy”. Instead, “boy” is treated as an exception to the “human” category, and is not expected to appear as an object. Simulation 3: Learning abstract structure The examples to this point have all involved relatively simple grammatical constructions. Many grammatical generalizations cannot be couched in terms of simple sequential order, however. The agreement in number between a subject and its verb, for example, must be observed even in the presence of other subjects and verbs that may intervene (as occurs when the subject NP is head of a relative clause in which a different NP is subject). Can this abstract structure be learned, given only the surface strings of a language as evidence? Some have suggested the answer to this question is No. In a provocatively titled paper, ‘Language acquisition in the absence of evidence’, Crain (1991) presents what he calls the “parade case” of an innate constraint. The example involves the so-called “auxinversion” associated with Yes/No questions, as shown in (7) and (8). (7) (8)

a. The boy is crazy. b. Is the boy crazy? a. The girl can smoke. b. Can the girl smoke?

Children presumably learn there is some association between the simple declaratives in (7a) and (8a) and the corresponding questions in (7b) and (8b). The simple examples shown here invite the generalization that the question form of the declarative involves inverting the subject noun with the auxiliary that follows. (Whether or not actual movement is involved is unimportant; what is relevant is the observation about which elements in the declarative correspond to which elements in the question.) Schematically one might imagine a child forming the generalization shown in (9). (9)

DECLARATIVE: QUESTION:

NOUN

AUX

X

AUX

NOUN

X

The problem is that this generalization fails in the case of sentences in which the subject NP heads a relative clause that also contains an AUX, as shown in (10). (10)

a. The boy who is smoking is crazy. b. *Is the boy who __ smoking is crazy? c. Is the boy who is smoking __ crazy?

Application of the generalization in (9) yields the ungrammatical (10b). The true generalization, reflected in (10c), depends an abstract representation involving constituent structure, and realization that the AUX that participates in the question formation is in the matrix clause, not the subordinate relative clause. Crain’s argument is that children never encounter the sort of evidence provided by examples such as (10c). 16

Therefore, the fact that they are relatively error-free for these sentences suggests that their performance is based on innate knowledge. As Crain puts it, “every child comes to know facts about the language for which there is no decisive evidence from the environment. In some cases, there appears to be no evidence at all…these observations invite the inference that constraints are innately specified. In a nutshell, the argument is based on the ‘poverty of the stimulus.’” (Crain, 1990). The notion of ‘poverty of the stimulus’ depends on a notion of what the relevant stimuli are that is itself rather limited. The assumption is that the only relevant evidence—in this case, for the abstract version of AUX inversion—would be provided for by sentences such as (10c). Let us call this “direct positive evidence.” And indeed, examination of the Manchester Corpus (Theakston, Lieven, Pine, & Rowland, 2000) in the CHILDES database (MacWhinney & Snow, 1985) suggests that there is in fact little, if any, direct positive evidence of the sort required. But might the child encounter other evidence that motivates the notion of constituency and abstract structure? We know that SRNs are capable of learning to represent abstract structure of the sort required for recursion (Elman, 1993; Rodriguez, 1999; Rodriguez & Elman, 1999; Rodriguez, Wiles, & Elman, 1999). Does the input to the child contain sufficient evidence for such structure, and if so, might it be possible to extrapolate the use of such structure to AUX questions? Lewis and Elman (2001) carried out a set of simulations designed to answer exactly this question. We assumed that the child is exposed to sentences of the form shown in (11). (11)

a. The women smoke cigarettes. b. The boy is funny. c. Can you read this? d. Is she friendly? e. The girl with the cat is nice. f. The toy with the wheels is mine. g. The man who came late is your uncle. h. I like the teacher who speaks Spanish.

We assumed further that children do not hear questions of the form shown in (10c); these are precisely the question forms that are at issue. We then generated an artificial corpus containing such sentence types, using frequency of occurrence as measured in the Manchester Corpus, converted it to vector form, and then trained an SRN to process the sentences. Processing in this case involved predicting, at each point in the sentence, what word(s) might come next. Because the sentences are generated by a non-deterministic grammar, precise predictions are not possible. Instead, what the network must learn to do is predict the grammatically correct possibilities. After training, we can then probe the network’s knowledge by presenting it, for the first time, with the problematic question in (10c) to see what it expects. There will be two points that are particularly revealing, shown in (12). (12)

Is the boy who[1] is smoking[2] crazy ?

17

At [1] (i.e, after having processed the word “who”) the assumption that follows from Crain (1991) is that the network—presumably lacking an innate constraint on structure knowledge—will fall into the trap of believing that questions follow the template shown in (9). Thus, the network would incorrectly predict that the initial “Is” fills a gap that occurs at [1], and would thus expect the next word not to be an AUX, but any other element that would normally follow the AUX, e.g., “smoking”. The grammatically correct prediction, on the other hand, would be to expect either an AUX or a verb, since the initial “Is” cannot come from the embedded clause. At [2], on the other hand, if Crain’s analysis is correct, the network should expect an AUX (as in “Is the boy who smoking is crazy?”). The grammatical correct prediction would be to expect an adjective or participle (i.e., elements that would normally follow the gapped AUX). Note that such sequences of AUX-participle or AUX-adjective never occur in the training sequence, so the grammatically correct prediction in fact involves violating sequential patterns in the training set. The network’s predictions at each point in its processing the probe sentence (12) are shown in Figure 11.

[1]

[2]

Figure 11. Predictions after each input. The first prediction is for the initial word of the sentence. See text for explanation of the significance of predictions at [1] and [2].

At probe point [1] the network has processed the input “Is the boy who…” The prediction is for either an intransitive or transitive verb in the singular, or for a singular AUX. The network does not, in other words, appear to believe that the initial “Is” fills a gap at this point. Rather, the network seems to understand that the initial “Is” corresponds 18

to a gap position following “smoking” (probe point [2]). At the second probe point, rather than expecting an AUX—in conformance to the sentences it has encountered in the training data (e.g., “The boy who is smoking is crazy”)—the network expects an ADJ or participial, which in fact is what occurs next. As mentioned above, we know from previous simulations that SRNs are able to infer abstract grammatical structure of the sort needed to deal with agreement relations in complex sentences (Elman, 1993). We have some understanding of the mechanisms involved, and that under idealized conditions (given infinite precision weights and activations), SRNs can process infinite depth recursion (Rodriguez, 1999; Rodriguez & Elman, 1999). (The point to be borne in mind, obviously, is that these results hold for imaginary machines; more interesting is to see the ways in which performance degrades in the face of physical limitations of the sort present in brains.) So we should not be surprised that the network in this case has learned abstract structure. What is interesting is that it has been able to use the evidence for such structure in the service of a sentence type—a Yes/No question with complex subject—that it has never encountered before. What is that evidence? Consider the types of sentences used in training, of the sort shown in (11), reproduced here. (11)

a. The women smoke cigarettes.. b. The boy is funny. c. Can you read this? d. Is she friendly? e. The girl with the cat is nice. f. The toy with the wheels is mine. g. The man who came late is your uncle. h. I like the teacher who speaks Spanish.

What can be learned from these examples? (11a) and (11b) are used by the network to learn about grammatical categories, simple subject-verb agreement, and subcategorization and selectional restrictions. Sentences (11c) and (11d) provide evidence for question formation in simple sentences. Sentences (11e-11h) teach the network about more complex sentences, and in particular, about NP constituents. In (11f), for example, the network learns that agreement is between the NP head (“toy”) and its verb, not the more proximal NP (“wheels”). These latter sentences are crucial for learning about abstract structure, albeit in contexts different than AUX questions. Schematically, we can think of these sentences as evidence for different generalizations, as shown in Figure 12. Four different generalizations can be inferred from the different sentences types: (1) how to form simple sentences, (2) how to form complex sentences, and how to deal with NP constituents, (3) how to form declaratives, and (4) how to form questions. The generalizations are orthogonal from each other, but have the potential to interact. Thus sentence (11c) is an example of a simple sentence, and also of a question. Not all interactions need be present in the data. There are no examples which illustrate questions involving subject NPs that are relativized and contain AUXes (this is the shaded quadrant in Figure 12). That is, there is no direct positive evidence for such sentences. The other sentences types, however, support generalizations that interact in 19

ways that logically imply the missing sentences. These other sentences thus constitute another important source of evidence: indirect positive evidence. Those sentences in effect conspire to support the complex generalization that the correct question is “Is the boy who is smoking crazy?” and not “*Is the boy who smoking is crazy?”

DECLARATIVES

SIMPLE

COMPLEX

The women smoke cigarettes. The boy is funny.

QUESTIONS Can you read this? Is she friendly?

The toy with the wheels is mine. The man who came late is your uncle.

Figure 12. The four generalizations that can be inferred from the different types of sentences present in the training data. The generalizations potentially interact, and so although there is no direct positive evidence for the sentence type in the shaded quadrant, these sentences (e.g., “Is the boy who is smoking crazy?” are logically implied.

Discussion and conclusion We began with a set of puzzles, loosely derived from the observation that although children have massive experience with language, there are huge gaps in the input. As much as children hear, there is vastly more that they do not hear. Furthermore, much of what they hear occurs in idiosyncractic environments, and a crucial challenge is knowing when and how to generalize beyond the input. Generalization is tricky, because some gaps are accidental and others are systematic. The problem is complicated by the fact that many generalizations reflect abstract structure that may not be obviously apparent from just a few examples. Confronted with such treacherous waters, children appear to proceed cautiously. They are for a long time extremely conservative, and rarely venture too far beyond the safe harbor of mimicry. Of course, ultimately they gain both the experience and communicative needs that permit them to generalize beyond the input. Put in more traditional terms, they learn the grammar of their language. The simulations described here involve simplified models. The networks are not children, and the tasks (explicit attachment; prediction) are superficial compared to the deeper problem of language understanding. There is obviously no guarantee that the networks’ solutions are those that are adopted by children. On the other hand, the networks’ success in providing reasonable solutions to the various puzzles they were confronted with suggests that statistically-based learning may

20

be more powerful than was once (e.g., Miller & Chomsky, 1963) imagined. The lessons learned include the following. (1)

Usage-based learning does not necessarily involve only memorization of specific patterns of usage (though that may occur too). The distinction between learning statistics and statistically based learning is an important one. The networks in the simulations described here used experience as grist for induction.

(2)

The problem of gaps in the input is a serious one, but may not present an insurmountable difficulty for induction. The strategy employed by the networks here was to begin by hewing close to the observed data until there was sufficient evidence for abstractions that might justify generalization through a gap. Such generalizations are sensitive to the expected occurrence of items, however, such that a persistent gap—at a point later in experience where the base frequency of an item suggests it should have been encountered—may ultimately be treated as a systematic exception to the generalization.

(3)

Many generalizations do indeed rest on abstract patterns that are not explicitly marked in the surface patterns of language. Taken together, however, those surface patterns may provide evidence of the abstract underlying structures that are required for correct generalizations. We have long known that such inferences are possible using explicit hypothesis-testing (after all, this is precisely what linguists do for a living). What is perhaps unexpected is the finding—exemplified in the final simulation—that abstract structure can be inferred by mechanisms that rely on implicit learning.

Admittedly, the present work barely touches the tip of the iceberg. One important issue not explored here is the question of what constraints are necessary for learning to be successful. Earlier work has suggested that either ordering the input or limiting initial resources may be important to learning complex linguistic structure (Elman, 1993). The findings described earlier, by Cameron-Faulkner et al. (2001), that child-direct speech occupies a narrow and possibly non-arbitrary range of language may reflect an adjustment of the input that facilitates acquisition (but of course, other explanations are also possible; it may be that the limited nature of the input arises because the pragmatics of parent/child interactions are specialized and limited, vis a vis the full range of communicative possibilities). Then there is the question of computational constraints. The networks described in the present work all involve a specific architecture, and it is clear that architectural constraints also have crucial computational consequences. Finally, the child herself brings a great deal to the table that is simply not taken into account by the present models. The child has an agenda: Certain topics are more important, and certain needs are more pressing. The network, on the other hand, is a passive observer with no particular goals or drives. Modeling the ontogeny (or perhaps, phylogeny) of such goals and drives is itself an interesting endeavor (cf. Nolfi, Parisi, & 21

Elman, 1994) but there is little work to date that tries to unite that work with language acquisition. Doing so in a non-trivial way is not an easy endeavor. Nonetheless, I conclude with a positive and optimistic outlook. I doubt that we will ever have models that learn language in exactly the same way that children do. But I also do not believe the “strong version” of artificial intelligence (building a machine that effectively replaces a human) is a reasonable goal. What is more likely, and what will be an exciting and useful result, is developing models that help us understand in detail the mechanisms used by children as they go about the business of learning language. I hope the present work moves us in that direction.

22

References Bowerman, M. (1976). Semantic factors in the acquisition of rules for word use and sentence construction. In D. Morehead and A. Morehead (Eds.), Normal and Deficient Child Language. Baltimore: University Park Press. Bowerman, M. (1981). The child's expression of meaning: expanding relations among lexicon, syntax, and morphology. In H. Winitz (Ed.), Native Language and Foreign Language Acquisition. New York: New York Academy of Science. Pp. 172-189. Braine, M.D.S. (1963). The ontogeny of the English phrase structure: The first phrase. Language, 39, 1-14. Braine, M.D.S. (1976). Children's first word combinations. Monographs of the Society for Research in Child Development, No. 41. Cameron-Faulkner, T., Lieven, E., and Tomasello, M. (2001). A construction based analysis of child directed speech. Forthcoming. Crain, S. (1991). Language acquisition in the absence of experience. Behavioral and Brain Sciences, 14:597–650. Elman, J. (1990). Finding structure in time. Cognitive Science, 14:179–211. Elman, J. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195–225. Elman, J.L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71-99. Gibson, E. and Schütze, C.T. (1991). Disambiguation preferences in noun phrase conjunction do not mirror corpus frequency. Journal of Memory and Language, 40, 263-279. Gomez, R. and Gerken, L. (1999). Artificial grammar learning by one-year-olds leads to specific and abstract knowledge. Cognition, 70:109–135. Hart, B. and Risley, T. (1995). Meaningful Differences in the Everyday Experiences of Young Children. Paul H. Brookes, Baltimore, MD. Huttenlocher, J., Levine, S., and Vevea, J. (1998). Environmental input and cognitive growth: A study using time-period comparisons. Child Development, 69, 1012-1029. Lewis, J.D. and Elman, J.L. (2001). A connectionist investigation of linguistic arguments from the poverty of the stimulus: Learning the unlearnable. In J.D. Moore and K. Stenning (Eds.) Proceedings of the TwentyThird Annual Conference of the Cognitive Science Society, Mahwah, NJ: Lawrence Erlbaum. Li, P. and MacWhinney, B. (1996). Cryptotype, overgeneralization and competition: A connectionist model of the learning of English reversible prefixes. Connection Science, 8, 3-30. MacWhinney, B. (1978). The acquisition of morphology. Monographs of the Society for Research in Child Development, No. 43. MacWhinney, B., and Snow, C. (1985). The child language data exchange system. Journal of Child Language, 12, 271-296.

23

Miller, G A. and Chomsky, N. 1963. Finitary models of language users. In R. Duncan Luce, Robert R. Bush, and Eugene Galanter (Eds.), Handbook of Mathematical Psychology, Vol. 2. New York: Wiley. Pp. 419-491. Nolfi, S., Elman, J.L., and Parisi, D. (1994). Learning and evolution in neural networks. Adaptive Behavior, 3:1, 5-28 Olguin, R. & Tomasello, M. (1993). Twenty-five-month-old children do not have a grammatical category of verb. Cognitive Development, 8, 245-272. Rodriguez, P. (1999). Mathematical Foundations of Recurrent Neural Networks in Language Processing. Ph.D. Thesis. Department of Cognitive Science, University of California, San Diego. Rodriguez, P. and Elman, J. (1999). Watching the transients: viewing a simple recurrent network as a limited counter. Behaviormetrika, 26, 51-74 Rodriguez P., Wiles, J., and Elman, J.L. (1999). A recurrent neural network that learns to count. Connection Science, 11, 5-40. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986) Learning representations by back-propagating errors. Nature, 323, 533-536. Saffran, J., Aslin, R., and Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926–1928.

Theakston, A., Lieven, E., Pine, J., and Rowland, C. (2000). The role of performance limitations in the acquisition of ‘mixed’ verb-argument structure at stage 1. In M. Perkins and S. Howard (Eds.), New Directions in Language Development and Disorders. Plenum. Tomasello, M. (1992). First verbs: A case study of early grammatical development. New York: Cambridge University Press. Tomasello, M. & Olguin, R. (1993). Twenty-three-month-old children have a grammatical category of noun. Cognitive Development, 8, 451-464.

24

Generalization from Sparse Input Jeffrey L. Elman University of ...

Generalization from Sparse Input Jeffrey L. Elman University of ...

Suggest Documents

Generalization from Sparse Input Jeffrey L. Elman

Generalization from Sparse Input - CiteSeerX

Jeffrey L. Ram Department of Physiology Wayne State University (i ...

JEFFREY L. STEIN – Curriculum Vitae - University of Michigan

Elman Poole Travelling Scholarship - University of Otago

Dave Elman Inductie

Die Dave Elman Induktion

JEFFREY F. COHN - University of Pittsburgh

Efficient Implementation of Self-Organizing Map for Sparse Input Data

From Sparse Solutions of Systems of Equations to Sparse ... - CiteSeerX

DAVE ELMAN INDUCTION - Meetup

A Generalization of Aumann's Agreement ... - Princeton University

Elman - Rav Hutner.pdf - Google Drive

The Generalization Complexity Measure for Continuous Input Data

PdF Download The Business of Winemaking Jeffrey L ... - Google Sites

DIMENSIONS OF REFLEXIVITY by Jeffrey L. Lidz A ... - CiteSeerX

physiochemical responses of zaleya pentandra (l.) jeffrey to nacl ...

Jeffrey L. Connolly President of West Michigan Operations & Managed ...

Benjamin A. Elman, On Their Own Terms ... - Princeton University

Learning Sparse Perceptrons - Duquesne University

Jeffrey S. Anastasi - Sam Houston State University

l~l~l~~l~l~l - Cranfield University

UNIVERSITY, OF L ALIFORNIA,

l - University of Virginia