Language and Cognitive Processes, 1995, 10 (6), 601-630
Default Generalization in Connectionist Networks
Mary Hare
Jeffrey L. Elman
Kim G. Daugherty
Center for Research in Language University of California, San Diego
[email protected]
Dept. of Cognitive Science University of California, San Diego
[email protected]
Hughes Aircraft Company and Department of Computer Science University of Southern California
[email protected]
Abstract A potential problem for connectionist accounts of inflectional morphology is the need to learn a ‘default’ inflection (Prasada & Pinker, 1993). The early connectionist work of Rumelhart & McClelland (1986) might be interpreted as suggesting that a network can learn to treat a given inflection as the ‘elsewhere’ case only if it applies to a much larger class of items than any other inflection. This claim is true of the Rumelhart & McClelland (1986) model, which was a two-layer network subject to the computational limitations on networks of that class (Minsky & Papert, 1969). However, it does not generalize to current models, which have available more sophisticated architectures and learning algorithms. In the current paper we explain the basis of the distinction, and demonstrate that given more appropriate architectural assumptions, connectionist models are perfectly capable of learning a default category and generalizing as required, even in the absence of superior type frequency.
1.0 Introduction Over the past few years there has been an ongoing debate in the literature about the mechanism or mechanisms required to account for the productive use of language, and in a recent article Prasada and Pinker (1993) widened the discussion by considering the basis of ‘default’ inflectional categories. The concern of that article was the potentially crucial distinction between ‘regular’ and ‘irregular’ inflection in English verbal morphology. As is well-known, the great majority of English verbs mark their past tense through the regular and productive process of adding the suffix -d to the verb stem. But the language also contains approximately 160 verbs that form their past tense in some other way. This irregular inflection has interesting properties that suggest it may be qualitatively different from the regular. For one, the survival of irregular verb inflections is highly sensitive to Default Generalization
1
frequency. As one example of this, Bybee (1985) examines three classes of Old English ablaut verbs, and shows a systematic difference over time such that low frequency class members have regularized, while high frequency members have tended to remain irregular into the modern language. The regular inflection, by contrast, is insensitive to frequency effects. Seidenberg and Bruck (1990) showed that the latency to produce a past tense verb in response to the stem is affected by the frequency of the past tense form if the verb is irregular, but there is no difference if the verb is regular. There are also apparent differences between regulars and irregular in their degree of sensitivity to phonological information. For example, Bybee and Modor (1983) showed that in tests requiring subjects to inflect nonce verbs, the more the novel item phonologically resembles a known irregular, the more likely it is to receive irregular inflection. Regular inflection, by contrast, does not appear to be influenced by phonological similarity in this way. Instead, it applies to any verb stem, regardless of the verb’s phonological shape, unless it is superseded by a more specific past tense form. For this reason the regular past is considered the default inflection. It applies to borrowings and new words (‘The cop quickly mirandized the suspect’), even words that do not fit the normal structure of English words, like xeroxed. This is not to say that the regular past tense is entirely insensitive to phonological form, for Seidenberg and Bruck also found that the latency to generate a regular past tense is related to the consistency of that verb’s phonological neighborhood: regular verbs with irregular neighbors have reliably longer latencies than regulars whose neighbors are regular as well. But what must be explained is why the regular past tense is treated as the default, and often applies to novel verbs even when they do not closely resemble any existing regular verb. There are at least two current accounts in the developmental and modeling literature of how to account for the facts of English verb inflection. One, associated largely with Pinker and colleagues (Pinker and Prince, 1988; Kim et al. 1991; Prasada, Pinker, and Snyder 1990) takes the difference in behavior between regular and irregular inflection to reflect an underlying distinction in the mechanisms by which they are produced. Regular inflection, on this account, is the product of a symbolic system using rules in the traditional sense. This account handles the question of the default by assuming a rule of the sort ‘Concatenate affix -d onto [VERB]’ where [VERB] is an abstract symbol. Phonological information about the specific verb is not available, and hence can play no role in the application of the rule. A contrasting view, first taken by Rumelhart &McClelland (1986), and more recently by Plunkett and Marchman (1991, 1993) and Daugherty and Seidenberg (1992, in press), assumes that both regular and irregular inflection arise from the same mechanism, a single connectionist network, with the differences in behavior attributable to other factors. The Rumelhart and McClelland (1986) model appeared to successfully treat the regular past tense as the default, inflecting 66% of new verbs this way in generalization tests. Prasada and Pinker (1993) argue that this behavior is an artifact of the English data set, since 95% of the verbs of modern English are regular (Daugherty and Seidenberg, in press). Default Generalization
2
They then ran a series of tests to show that the model was unable to perform adequately under a wider variety of circumstances. In this study Prasada and Pinker first tested human subjects on a version of the Bybee and Modor (1983) experiment, and replicated the earlier results: if a novel verb strongly resembled a known irregular, it was likely to receive irregular inflection. Thus spling could be inflected as splung on analogy to string - strung.But as the similarity between the novel verbs and known irregulars decreased, the tendency to inflect the verbs irregularly decreased as well. Prasada and Pinker then ran a similar task using regular verbs, and found no such gradation of response. Instead, subjects were equally willing to inflect the novel verbs with the regular suffix when those verbs were similar to existing regulars (as gloke or proke) dissimilar to regulars (e.g. smaig) or even if they contained final consonant clusters that disobeyed the phonotactic constraints on English verbs (ploamph). The authors then attempted to replicate these results with the Rumelhart and McClelland model, and found that in the model, regular verbs were treated very much like irregulars in this respect. That is, although the model was able to generalize the regular inflection to novel verbs that resembled members of the training set, it could not consistently do so as the test patterns decreased in similarity to training exemplars. This, of course, is not how the human subjects behaved. Why would the model behave this way? According to Pinker and Prasada, a connectionist network is able to generalize only on the basis of pattern similarity. Now, if this were true, a novel item could be treated as a member of a default class only if it resembled a previously learned member of that class. As a result, in order to achieve what looks like ‘default’ behavior, the network would need a learned default class that is extremely wellpopulated, for it would have to cover the entire phonological space of the language in order to ensure that all novel items would fall sufficiently close to a training example.The Rumelhart and McClelland model is based on the statistics of English, where the regular past tense vastly outnumbers the irregular pasts. It was this difference in class size, the argument goes, that allowed the model to achieve the success it had in generalizing new members to the regular class. But while in English the regular past is both the default and the statistically dominant class, it would be a mistake to confuse the two, for there are languages with clear default classes that are no larger than the non-defaults. Clearly, if statistical superiority were necessary in the network, the network does not offer an adequate account of inflectional behavior in natural language. Prasada and Pinker suggest that this is true, and since the Rumelhart and McClelland model cannot account for true default inflection, no connectionist network can. But this conclusion is based on misconceptions of both network dynamics and the structure of the data in natural language. In the current paper, we will explain the basis of the misconceptions, and argue that given more appropriate assumptions, connectionist models are perfectly capable of learning a default category and generalizing as required, even in the absence of superior type frequency.
Default Generalization
3
2.0 Network dynamics 2.1 The “tyranny of surface similarity” The Prasada and Pinker claims are stated in reference to Rumelhart & McClelland (1986), and they are manifestly true of that model. Why should that be? The explanation lies in the structure of the network. The Rumelhart & McClelland model used a simple, two-layer network, in which a layer of input units connected directly to a layer of output units. Such two-layer networks are subject to well-known limitations. In an early analysis of these mechanisms, Rosenblatt (1962) demonstrated that if a set of connection weights exists that satisfy the constraints of a problem that the perceptron is set to solve, then a learning algorithm, the perceptron convergence procedure, existed that led the network to converge on that state. Unfortunately, the range of problems that such networks can solve was limited in ways that severely restricted their use as cognitive models. The limitation was first pointed out by Minsky and Papert (1969), which showed that if a set of input-output pairs are not linearly separable, the perceptron convergence procedure will not find the correct solution. The classic example of this situation is the exclusive-or (XOR) problem. In this problem, a network is required to respond yes if either one or the other of two input bits is on, but no if neither is on, or if both are. The four possible combinations of two input bits, and the correct response, are given in Table 1. TABLE 1. Input - output combinations for the XOR problem. When either (but not both) of the input bits is on, the network must respond yes (1). Otherwise, response must be no (0). INPUT
OUTPUT
00
0
01
1
10
1
11
0
Note that the patterns 0 0 and 1 1 are most dissimilar on the input, yet must receive the same response on the output. In the same way, the two patterns that must receive a 1 response, 0 1 and 1 0, also show no similarity. The problem becomes even clearer if the patterns are presented visually. Figure 1 graphs the four input pairs, with input 1 on the x-axis, input 2 on the y-axis. In order to solve the problem, a network must be able to draw a line that separates the two classes of input, so that the patterns requiring a 1 response fall on one side of the line, those requiring 0 fall on the other. As the figure shows, there is no way this can be done if all that is considered is the relationship between input and output.
Default Generalization
4
1,1
0,1
1,0
0,0 input 1
(a)
1,1
0,1
1,0
0,0 input 1 (b)
1,1
0,1
1,0
0,0 input 1 (c)
FIGURE 1. XOR graph. Geometric representation of the XOR problem. If the four input patterns are represented as vectors in two-dimensional space, the problem is to find a set of weights which implements a linear decision boundary for the output unit. In (a), the boundary implements logical AND. In (b), it implements logical OR. There is no linear function which will simultaneously place 00 and 11 on one side of the boundary, and 01 and 10 on the other, as required for (c). Default Generalization
5
2.2 Solution to similarity problem This is not to say that the XOR problem is unsolvable, for in their critique of perceptrons Minsky and Papert also showed that the addition of an internal layer of processing units eliminated the difficulty (see also McClelland and Rumelhart, 1986). This ‘hidden’ unit layer receives activation from the input, and passes activation on to the output. In doing so, the hidden units are able to re-represent the input patterns, altering their similarity structure in a way that allows the network to reach the correct solution. Figure 2 gives an example of the internal representation of the four XOR patterns learned by a network with two hidden units. As the figure shows, the input patterns are now grouped in a way that correctly partitions the pattern space. Although Minsky and Papert were well aware of this solution, at the time there was no known learning algorithm comparable to the perceptron convergence procedure which could be applied to multi-layer networks of the sort needed to solve non-linearly separable problems. In recent years, however, a number of such learning algorithms have been developed; perhaps the best known is the back propagation of error algorithm (Rumelhart, Hinton, and Williams 1986). These developments allow contemporary network models to escape many of the limitations of the two-layer network. As a result Minsky and Papert’s conclusions about the tyranny of surface similarity no longer apply to connectionist models as a class. And, since Prasada and Pinker’s claims are a restatement of the same objection, these also hold true only of earlier, two-layer models and do not generalize to multilayer networks. In conclusion, Prasada and Pinker’s claims are true of the Rumelhart and McClelland model, which is the model they tested. But while this model was a remarkable contribution to the field, advances in learning theory have made it obsolete in certain respects, and its shortcomings do not carry over to the more sophisticated architectures that have since been developed. In the second half of this paper we will demonstrate the validity of the distinction, using networks that involves an intermediate layer of processing units. Before moving on to the model, however, we will first set out the facts about the data
Default Generalization
6
0,1
1,1
0,0
1,0 input 1
(a)
1,1 0,0
0,1
1,0 hidden unit 1
(b)
1,1 0,0
1,0 0,1 output
(c) FIGURE 2. .XOR hidden unit graph. Transformation in representation of the four input patterns for XOR. In (a) the similarity structure (spatial position) of the inputs is determined by the form of the inputs. In (b) the hidden units “fold” this representational space such that two of the originally dissimilar inputs (1,1 and 0,0) are now close in the hidden unit space; this makes it possible to linearly separate the two categories. In (c) we see the output unit’s final categorization (because there is a single output unit, the inputs are represented in a one-dimensional space). Arrows from (a) to (b) and from (b) to (c) represent the transformation effected by input-to-hidden weights, and hidden-to-output weights, respectively. Default Generalization
7
3.0 The nature of a low-frequency default The most commonly cited example of a low-frequency default class is that of the Arabic noun plural. In that language there are a number of ‘broken’ plural classes. Together these greatly outnumber the ‘sound’ plural, which applies to only about 9% of the nouns in the language (K. Plunkett, pers. com.). Yet the sound plural behaves like the default plural marker, applying to borrowings and other new nouns (McCarthy and Prince 1990, Bybee 1994). There are strong constraints on the form of the nouns in each broken plural class, and in their discussion of the Arabic plural, Plunkett and Marchman (1991) suggest that this is the significant fact that permits the default to develop as it has. For each of the broken plural classes there is a phonological template that all members must fit, while the sound plural, the default, is made up of nouns whose forms do not fit those templates. No examples have been found of a low frequency default class in a language where each inflectional class is made up of phonologically arbitrary collections of verbs or nouns. Interestingly, the same conditions appear to be required for the learning of a low-frequency default in a connectionist network. Prasada and Pinker considered the results of several network models and showed that these nets, trained on data sets which were not appropriately structured, did not exhibit anything resembling default generalization. Plunkett and Marchman (1991) also reported that their network had difficulties learning the regular inflection when the data set contained a large number of irregular verbs whose form was not readily distinguishable from the form of the regulars. On the other hand, Daugherty and Seidenberg (1992), by using a realistic corpus of English verbs, achieved 94% accuracy on generalization to novel regulars. These examples show that networks, like real language, do not inevitably develop a default category. On the other hand, when the data are structured as in real language examples, networks can learn a default class and generalize appropriately.
4.0 Background to the model 4.1 Data The discussion in the literature has largely focussed on the Arabic noun plural, but this is not the only existing example of a low-frequency default. What is required is the ability to account for the existence of a possible class of cases. In fact, earlier stages of English are very likely to have shown exactly this sort of a default category in the verb inflectional system (see Hare & Elman, in press, for a fuller discussion of the issues raised here). This is so for the following reasons. ‘Strong’ inflection, which marked tense and other inflectional categories by changing the vowel of the verb stem, was the dominant form in Indo-European. As the Germanic languages developed Default Generalization
8
they began to use a suffix to mark the past tense on derived verbs. This suffixed past was an innovation, and so necessarily went through a period when it applied to a statistically smaller group than the well-entrenched strong classes. Over time, though, the situation reversed. By Old English the suffixed ‘weak’ past, the ancestor of the modern regular past, accounted for approximately 75% of all verbs (Quirk and Wrenn, 1975), and in the modern language it has grown to the point where only about 160 irregular examples like singsang-sung remain strong. The reason for the growth of the ‘weak’ inflection is clear: even before the Old English period, at a time when it had limited type frequency, this was the productive form, applying to derived and borrowed words. We believe that if a network cannot solve the logical problem posed by these facts, this lessens its viability as a valid model of the synchronic English inflectional system. That is, we adopt what might be called the Consistency Hypothesis. We require that the mechanisms which underlie synchronic processing at least be consistent with diachronic changes, and preferably also give insight into those changes. For this reason we will use data based on these facts in showing how the default is handled by a network. The organization of the modeling section is the following. We will begin with a categorization task, since this responds to the general claim that a network cannot learn default categorization without the aid of statistical superiority. The output of this model is localist, and so offers a clear response that is straightforward to analyze. Having established the general behavior of the model with this task, we will then replicate the first set of results with a more complex and linguistically realistic task, in which the model must learn to categorize implicitly by applying the correct inflection to phonologically specified verb stems. The goal of the second simulation is to show that the behavior of the network is the same across the two models, and so success in the categorization task is not due to the simplicity of the design.
5.0 Categorization task 5.1 Introduction In a discussion of the Arabic plural system, where the default plural is of relatively low type and token frequency, Plunkett and Marchman (1991) suggest that the crucial fact is that ‘the numerous exceptions to the default mapping... tend to be clustered around sets of relatively well-defined features.’ The inflectional situation in early stages of English was parallel in many respects. As discussed in more detail above, the ‘weak’ or suffixed past tense was treated as the default inflection even at a point when it cannot have exhibited high type or token frequency relative to the ‘strong’ or ablaut classes. Furthermore, the strong inflection classes were made up of verbs that exhibited clear phonological coherence. In OE, each strong class had its own vowel series (by definition) and other phonological features by which the classes could be distinguished. The weak verbs as a whole had no such criteria for membership.
Default Generalization
9
Our hypothesis is that this phonological information is what allows the default/non-default distinction to be learned. That is, in learning to produce the strong classes, a network naturally learns to respond to the relevant phonological characteristics that are cues to each class. If a network is taught a series of such mappings, it will generalize on the basis of the shared regularities. Novel patterns lacking those regularities cannot be adopted into any phonologically defined class. If, however, the inflectional system also includes a class whose membership is not keyed to phonological generalizations, this class will become the productive one since it is capable of accepting members that do not fit elsewhere. To test this hypothesis we taught a network a simple categorization task, described in the next section.
5.2 Model 5.2.1 Architecture The model used a feed-forward architecture with 50 input, 18 hidden, and 6 output units. Inputs to the model were 50-element vectors, each of which represented a word. A subpart of each word was a particular VC or VCC pattern, defined over distinctive features. On the output layer there were six nodes which were localist representations of six inflectional classes. The task of the model was to learn to respond to each input pattern by activating the appropriate category label on the output.
1 2 34 56 6 output units
18 hidden units
50 input units 6
6
6
phonological features
32
‘lexical’ part (localist encoding)
FIGURE 3. Architecture of the model used in the categorization task.
Default Generalization
10
5.2.2 Stimuli Input stimuli were divided into categories on the basis of the VC or VCC string that was taken to be the ‘rime’ of each word. For the first five categories, this was a version of the class characteristic by which the OE strong verb classes were distinguished. In the network, these were the following: (1) (2) (3) (4) (5)
i + any one consonant (cf. dri:fan ‘drive’) e + one stop or fricative (cf. wefan ‘weave’) e + a consonant cluster (cf. helpan ‘help’) i + a nasal+stop cluster (cf. bindan ‘bind’) a + one consonant (cf. scacan ‘shake’)
The sixth class, which will be treated as the default, contained items with any other VC or VCC string. As a second way of measuring the structure of the 6 classes, we computed the average distance between members internal to each class, where ‘distance’ is defined as the Euclidean distance between the vectors representing each item. The averages, given below, show that classes 1-5 exhibited relatively high internal consistency compared with class 6.
TABLE 2. Average internal distance, Classes 1-6
Class
Euclidea n distance
1
1.2
2
1.2
3
1.3
4
1.1
5
1.2
6
1.7
Each of the six classes had 32 members. These were randomly generated, with the constraint that the sixth class contain only a minority of the possible forms for that class, and that members of the class be broadly distributed through the input space and not sample the entire phonological space of this language. The reason for this was simple. The goal was to show that the network would learn to treat the sixth class as a default by reason of its phonological diversity. However, if the diversity allowed the class to span the entire phonological space of the language, then generalization to new patterns need not result from anything other than similarity to a learned class member. The sixth class was limited to prevent this from occurring.
Default Generalization
11
5.2.3 Training: The network was trained using the backpropagation of error learning algorithm (Rumelhart, Hinton, & Williams, 1986). Each of the 32 members of each class was presented once per pass through the data set. The network was trained for 20 such passes, after which error on the test set was extremely low. The trained network was then tested to see how it would generalize to novel data.
5.3 Results of generalization tests As a first generalization test, an additional 32 randomly-generated members of the phonologically predictable classes 1-5 were given to the fully trained net. All were categorized correctly into the class whose members they resembled. The network was then presented with 95 new patterns that did not match any subtype seen in training, and with VC or VCC rimes not found in members of classes 1-5. Twenty of these patterns generalized into one of these 5 classes nonetheless, arguably because of their resemblance to the learned patterns. A more precise analysis of the effects of pattern similarity will be given shortly, but for now note that eight of the twenty ‘irregularizations’ contained the string æ + C, which differed in only one feature from the a + C characteristic of Class 5, and these patterns most strongly activated the category node for that class. A further 12 patterns contained the string i + CC. This shared features of Class 3 (whose characteristic was e + CC) and Class 4 (characteristic i + NC). These activated both of the relevant class nodes to an intermediate degree, as would be expected in a system of this sort: because the network is driven to reduce overall error, the best response of the network in an ambiguous situation is to activate the nodes corresponding to both responses, to a degree that reflects the relative probability of each being correct. The other 75 patterns were placed in Class 6. The results so far are consistent with the claim that the network generalizes membership in classes 1-5 on the basis of similarity to previously learned patterns. However, it does not yet rule out the possibility that membership in Class 6 (our default) is based on similarity to known forms as well. If the network is treating Class 6 as a true default, this should not be true: Class 6 should accept any pattern that is not sufficiently close to exemplars of classes 1-5, regardless of similarity. In order to demonstrate that this asymmetry is the case, we ran two further tests. In the first, we used the results of the first two test sets to replicate the experimental results of Bybee and Modor (1982) and Prasada and Pinker (1993). This was done in the following way. We first calculated the distance between each test item and all members of each training category. Distance is again defined here as the Euclidean distance between the vectors which represent the test and training items. Based on this calculation, we divided the test items into three groups: those with a distance of ≤1.0 from a member of a training class (the prototypical test group), those with a distance of 1.4-1.7 (the intermediate group) and those with a distance ≥1.7 (the distant test group). We then measured the activation level of the category node activated in response to each test item, to determine the effects of distance on the network’s ability to generalize. Default Generalization
12
This analysis showed two interesting results. In the first place, there was a strong effect of distance for classes 1-5, the non-default classes, but no such effect for class 6. A novel item strongly activated the category node for classes 1-5 only if it was a prototypical match to a training example in that class. As the match became more distant, two things happened: generalization to that class became less common, and when it did occur, the activation on the relevant category node was much lower. For class 6, on the other hand, the tendency to generalize did not decrease, and category activation remained high regardless of the distance between training and test examples. Table 3 gives the average activations over the most strongly activated node for prototypical and distant matches to both the non-default and default classes. TABLE 3. Average activation of most strongly activated category node (possible range 0 - 1) for test items depending on their degree of match to learned category members. prototypical Classes 1-5 Class 6
intermediate
distant
0.989 (n = 160)
0.357 (n = 16)
0.384 (n = 1)
0.914 (n = 9)
0.903 (n = 37)
0.973 (n = 33)
Note that as the distance between test item and training exemplars increased to 1.7 or higher, there was almost no generalization to classes 1-5. This was not the case for class 6, which received new members even if those items had a closer match in some other category, so long as that closer match was too distant to exert an analogical pull. As a final illustration of this point we constructed a new generalization set, containing five patterns containing consonant or vowel sequences that did not exist in the training language. The average distance between these patterns and the closest training item was over 2.0. If the network were generalizing entirely on the basis of surface similarity, it would not be able to respond to these patterns at all. By contrast, if it has learned to treat Class 6 as a default class, it should put such deviant patterns into 6. This was in fact the case: when this test set was presented to the trained network, all patterns were classified as members of class 6. Table 4 gives the response of the network for each example in this set. TABLE 4. Response of the network to the deviant novel test set. Possible activation on each node ranges from 0 (off) to 1 (on). Pattern:
Actual response on each class node: 1
2
3
4
5
6
æCg
0.0
0.0
0.05
0.0
0.0
.95
no vowel
0.0
0.0
0.0
0.0
0.0
.99
diphthong
0.0
0.0
0.0
0.0
0.04
.99
Default Generalization
13
Pattern:
Actual response on each class node: 1
2
3
4
5
6
nasal vowel
0.0
0.0
0.0
0.0
0.05
.99
voiceless vowel
0.0
0.0
0.0
0.0
0.0
.99
These results are similar to the experimental results of Prasada and Pinker (1993), who found that the tendency to give a novel verb an irregular inflection varied as a function of the distance between the novel item and a known irregular, but found no such differences for pseudo-regular novel items.
5.4 Discussion The pattern of results suggests that while resemblance to a learned class member was crucial for generalization to Classes 1-5, it was irrelevant for generalization to Class 6. This asymmetry is consistent with the claim that Class 6 exhibits true default behavior. That is, Class 6 serves as an attractor for novel items even when those items do not bear form resemblance to existing members of this category. These results show that under the appropriate circumstances connectionist models are perfectly capable of learning a default category, and generalizing appropriately to novel data. One might argue, though, that the task facing this network was an overly simple one, and the successful results of this model may have been due to that simplicity. In that case the model would be incapable of being extended to the more detailed data sets required of an actual model of verb production. There are several reasons why this may have been so. On the input layer, the part of the phonological string that was intended as the basis for generalization was specified for each of the five predictable classes. The output of the model was also simplified. Since the model was compelled to accept one of the six categories offered, it was not given the opportunity of a null response or an entirely novel response to a test item. This was largely due to the fact that the problem was posed as an explicit categorization task, instead of requiring the network to categorize implicitly by producing correctly inflected forms. In the next section, then, we respond to these objections by replicating the earlier results with a more complex task. This second model was presented with the phonological form of a verb as input, and was taught to produce the correctly inflected past tense as output. In the generalization tests the choice of inflection is taken as an indication of how the network categorizes the novel verbs. The results of this second model closely resemble those of the first, showing that the simplifying assumptions of the first model were not responsible for that model’s performance.
Default Generalization
14
6.0 Inflection task 6.1 Introduction The underlying assumptions with which we approach the model are the same as in the first simulation. The network is expected to generalize novel items to the phonologically predictable classes on the basis of similarity to learned exemplars. That is, the model should extract relevant generalizations about the structure of the phonologically defined classes, and to inflect novel items in the same way if they fit those generalizations. In addition, test items that are a close but not exact fit to a certain non-default class should be placed in that class, and items that equally well match the characteristics of two or more of the predictable classes should be inflected in a way that reflects that ambiguity. It is also expected that the model will not be bound by similarity to learned patterns when generalizing membership to the default class. Test items that differ sufficiently from the training exemplars of the defined classes should be placed in the default class, regardless of whether they match any member of that class.
6.2 The Model 6.2.1 Stimuli As in the first model, training stimuli were roughly based on Old English verb classes. This is historically inaccurate as an example of a minority default, for by early Old English (EOE) the equivalent of the modern regular verbs already vastly outnumbered any irregular classes. Nonetheless it is a plausible data set to use, because the phonological structure of the vowel change classes in EOE was a holdover from the structure of Protogermanic, when the suffixed past was apparently treated as the default despite being an innovative form with relatively few exemplars. There were 150 items in the training set, divided into 6 classes of 25 members each. Five of the six classes were phonologically specified. These took specific stem-vowel / coda consonant combinations, and formed the past tense by changing the vowel of the present tense stem. The structure of each of these classes, and the vowel change involved, is given in Table 5. Note that the past tense vowel is predictable from the stem rime (vowel + coda), but not from the stem vowel alone, since in four of the five classes the stem vowel points to two different responses.
Default Generalization
15
TABLE 5. Training classes for the inflectional model
Past tense vowel
Example
stop consonant: {p, t, k, b, d, g}
a
bid → bad
e
fricative: {z, v, s, f}
i
rez → riz
3
e
r + fricative: {rz, rs, rf}
æ
kers → kærs
4
i
nasal + consonant: {nt, nd, mb, Nk, Ng}
u
sing → sung
5
a
stop: {k, d, g}
o
rag → rog
Class
Stem Vowel
Coda
1
i
2
In addition to the 125 vowel-change verbs, the data set included 25 verbs that took the regular past tense suffix -ed. This is the ‘default’ class. There are no phonological restrictions on verbs in this class; they can take any stem vowel, including vowels used by verbs in the other classes, and might take coda consonants that fit the characteristics of one of the vowel change classes as well. Note that Class 6 contains only one-sixth of the verbs in the training set, and does not cover the phonological space of the language. Nonetheless, test items that do not sufficiently fit the characteristics of the other five classes are predicted to be treated as members of this sixth class. Each of the 150 training items was pseudo-randomly assigned an onset consonant. 6.2.2 Architecture The problem was modeled using a network with two components, shown in Figure 4. The lower component consisted of a feedfoward network which learned the mapping from present tense to past tense. The past tense outputs consisted of phonemes which were represented probabalistically. These outputs were then processed by a clean-up network which converted the probabalistic activations into discrete phonemes. In the lower network, the input units represented the phonological form of the present tense verb. The representation used was that of Plaut and McClelland (1993), which relies on phonotactics to determine positional information. Briefly, an English syllable can be broken down into onset, nucleus, and coda, and the order of phonemes within onset and coda is constrained by their sonority. Sonority (a measure of perceptual salience) in an English syllable rises to a peak at the vowel, then tapers off. Thus if the two consonants t and r occur in the onset of an English syllable, their order must be tr-, with the more sonoDefault Generalization
16
rous r occurring closer to the syllable peak. If the same two consonants occur in the coda, on the other hand, their order must be reversed (-rt). Given this, if the representation includes individual sets of units for phonemes occurring in onset, vowel, or coda position, the order of occurrence within each position is determined by sonority.1
Clean-up Network stem nucleus
inflected nucleus
onset (4) nucleus (6)
coda (18)
-ed
-ed (1)
hidden (30)
onset (4) nucleus (6)
Feedforward Network
coda (18)
FIGURE 4. Architecture of model used to model the inflectional task.
1. There are minor exceptions to the sonority gradient in English, most notably the ability of -s to occur either before or after a voiceless stop in the coda (e.g. crisp vs. clips). These are handled as exceptions in this representation by adding the diaphone units ps, ts, and ks. When these are active the two corresponding phoneme units are interpreted as occurring in that order rather than in the expected order s + stop. Default Generalization
17
In our data set there were 14 consonants that could occur in onset position, 6 vowels, and 15 consonants plus three diaphones that could occur in the coda, for a total of 38 input units (see Table 6).
TABLE 6. Phonological representation for stem and past tense verbs onset
sbpdtgkfvslrmn
vowel
a æa e i o u
coda
r l m n Nb b d g ps ks ts s f v p t k z -ed
a. the low front vowel of hat. b. engma, as in the final sound of swing.
These input units connected to a bank of 30 hidden units, and directly to the output layer of the feedfoward component. The output consisted of 39 units; these were the same 38 phonemes as in the input, plus an additional unit to represent the suffix -ed. At each training iteration, the phonemic representation of a single present tense verb was activated over the input units and the task of the network was to produce the phonemic form of its inflected past tense over the output units. If each item is a clear member of a single class, the response is straightforward to interpret. However, much like the situation in the first simulation, when an item can plausibly be inflected in two different ways the interpretation is less clear. For the reasons stated in section 5.3, when the choice of inflection is ambiguous both inflected forms will be activated to some extent, though one of the two will have a higher level of activation. There are two drawbacks to taking this output as the final response of the model. On the one hand, if the activation of both forms is taken into account, the response will have the flavor of a blend. And, while English speakers occasionally do produce blends of regular and irregular past tense inflections (Bybee and Modor 1983), the phenomenon is rare. On the other hand, if only the activation of the strongest pattern is counted, the weaker response will be ignored altogether. This is also intuitively incorrect: When a verb can plausibly be inflected in two different ways speakers may produce only one form, but are generally aware of the alternative. What is needed is a way of taking the deterministic output produced in response to each input and interpreting it as the probability that the input verb will take a given past tense inflection. This was the motivation for the upper component of the model. This component consisted of an interactive activation network which took the inflectional output of the feedforward network and allowed the phonemes to compete, supporting those that cooccurred in its inflected form and inhibiting those that formed part of an inconsistent inflection. Such competitive networks have the general property that strong patterns tend to get stronger (the “rich get richer” effect noted in the word perception model of Rumelhart and McClelland, 1981). However, if there are two close competitors, the time required to settle into a single stable output will be longer than if there is only one candidate output. This means that the model gives a response which is discrete with regard to the actual output (only actual phonemes and not impossible blends) but is also continuous with regard to
Default Generalization
18
response time. This second response dimension can then be mapped directly to predictions about response latencies from human subjects. Units in the interactive activation network were divided into two pools, one for the stem vowels2 and the other for the inflectional vowels and the -ed suffix. There were inhibitory connections among the members of each pool, inhibitory connections from the inflectional vowels to the vowels of the stem, and an excitatory connection from the -ed suffix unit to the stem vowel units. There were no connections from the stem vowels back to the inflection vowels or -ed suffix. All excitatory connections had values of 1.0, and all inhibitory connections had weights of -1.0. The strength of the excitatory influence was set to 0.2, and the strength of inhibitory influence to 0.55.
6.3 Training The feedforward network was trained with the backpropagation learning algorithm, with a learning rate of 0.01, momentum of 0.9, and minimizing cross-entropy rather than the more common least squared error3. The network was trained for 10 passes through the data set, and the inflectional output at that point was passed on to the competitive network. The competitive network was run for as many cycles as it took for the network to settle to a clear response. In general, the network was considered to have settled if continuing iterations did not lead to a response change greater than 0.009. In practice, however, the network was also stopped when all potential competitors except one had been driven to activation levels below 0. In the test phase, test items were processed using the learned weights in the feedforward network, and the output sent on to the competition network in the same way. A total of five networks were run, with different initial random weights. Below we report results averaged over the responses of the five networks.
6.4 Results 6.4.1 Training results By the end of the 10 epochs of training the expected output was produced correctly for all training patterns in the five networks. In certain cases there were competing inflectional patterns, as shown by partial activation of the relevant output units. As examples, the Class 6 patterns nist, zir and vis overlapped to some extent with the i → a pattern of Class 1, so
2. The competition network was applied only to those units that yielded partial activations. Since the stem consonants were always identity-mapped from input to output, their form was never ambiguous allowing these output values to be converted directly to their phoneme equivalents. 3. Cross-entropy is a cost function that minimizes the distance between target probabilities and actual probabilities, and is appropriate in cases where the outputs can be interpreted as the probability of occurrence of some item - an interpretation that is valid in the present case. In addition, when error is measured in terms of cross-entropy, the partial derivative of the error with respect to changes in the weights has a form which makes better use of error information, for the present problem, than does least squared error. Default Generalization
19
that the a unit received a certain amount of activation. However, this was immediately suppressed by competition from the fully active i unit in the interactive phase of the model, leaving nisted, zired, and so on as the only choice. 6.4.2 Generalization results A set of novel items was then developed to test the generalization ability of the network. This was done in the following way. We first created a random set of novel items by concatenating onset, vowel, and coda phonemes into combinations not seen in the training set and compared each novel item against each item in the training data, to determine which member of the training set was closest (measured by Euclidean distance) to each novel item. Items were eliminated from the test list if their closest target was a learned member of Class 6, or if they were equally close to members of two or more training classes. This left 32 novel patterns which were unambiguously closest to the pattern of a single inflectional class. The average of the distances between each test item and the closest member vowel change class is 1.72. The average distance between the test items and the closest member of Class 6 is 2.05. i. Choice of inflection To review, assumption behind this model is that the network would learn the criteria that governed the vowel change classes, and place a novel item in a vowel change class if it fits those criteria closely enough. By contrast, the model should not be bound by phonological similarity when generalizing membership to the default class. This makes the following predictions for our test set: First, novel items will show a tendency to generalize into the class of their closest target. Second, items that do not generalize into the closest vowel change class should be placed in Class 6 at a much higher rate than into a competing vowel change class. This is expected if generalization to Class 6 is not governed by phonological similarity, but generalization to the vowel change classes is so governed. This is related to the third prediction, that as the distance between test item and target vowel change class increases, the tendency to generalize novel items into the target class should drop off. The 32 items in the test set, and the network responses to each, are given in the Appendix. Table 7 summarizes the generalization behavior these items, averaged across the 5 network trials. As the table shows, the predictions are borne out. A sizable proportion of the test items did indeed inflect on the model of the closest vowel change class. Overall, however, the majority of test items took the Class 6 inflection, regardless of their lack of similarity to members of Class 6, while only 9% of the test set inflected according to a less similar vowel change class.4 4. These results are consistent with those found for human subjects in Bybee and Modor (1983). In that study, when the novel verb fit the string prototype on all features, 18% of the subjects responses used the suffix -ed. As the fit to the prototype became less good, the percentage of -ed inflection rose to 50% or more. Although at all degrees of deviation from the prototype the majority of responses involved either the regular -ed or the expected vowel changes /^/ or /æ/, overall 17% of the vowel change responses involved some other vowel. Default Generalization
20
TABLE 7. Results for all 32 test items, simulation two. Average of 5 network trials. Class
Percentage of responses
Closest vowel change class
39% (n = 12.6)
Class 6 (default)
51% (n = 16.2)
other vowel change class
10% (n = 3.2)
In order to test the third point, that the tendency to generalize into the closest vowel change class would decrease as the distance between test and training exemplars increased, we next divided the 32 test items into two groups based on their degree of overlap with members of the target class. This was measured by subdividing each test item into onset, vowel, and coda, and grouping them according to how many of these constituents differed from the phonological template of the closest class. As examples, the test item dif was closest to Class 1, and differed from the Class 1 template of i + stop only in the coda. Korv, on the other hand, is closest to Class 2, but differs from the Class 2 template (e + one C, a fricative) in both the coda and the vowel. Thus dif is a member of the ‘close’ group, while korv is ‘more distant’. Measuring distance by number of constituents in common, of course, correlates with the euclidean distances: For the close group the average euclidean distance between test and target item is 1.42 (average distance to Class 6 is 2.02), while for the more distant group the average is 1.8 (average to Class 6 is 2.1). There were 19 members in the close group and 13 in the more distant group. Table 8 gives the results for the test set divided along these lines. Again, the results are given as percentages of responses averaged over the five networks. Note that in the close group 50% of items inflect in the manner of the vowel change class they most closely resemble. In the more distant group, however, where the test items are less similar to the vowel change class, the proportion of such inflections drops from 50% to 23%, with a corresponding rise in the proportion of Class 6 inflections. Generalization to more distant vowel change classes remains stable around 10%. TABLE 8. Results for test items, subdivided according to distance from target class. Class
Percentage of responses Close group (n = 19)
More distant group (n = 13)
Closest vowel change class
50% (n = 9.6)
23% (n = 3.2)
Class 6 (default class)
40% (n = 7.4)
68% (n = 8.6)
10% (n = 2)
9% (n = 1.2)
other vowel change class
ii. Time to settle to a response At the end of the ten epochs of training, items in the training set took an average of 15 iterations to settle to an unambiguous output in the competitive network. Using this baseline Default Generalization
21
as the expected time to settle, we looked at the number of iterations required by the test items. In 13 of the 32 test cases there was no ambiguity in the output of the feedforward component. These items settled to a single response very rapidly, and will not be considered further. In the other 19 cases, where there was interference between two (or more) competing responses, items took correspondingly longer to settle. All settling times are given in Table 9. A look at Table 9 shows that competition is always between two relatively close vowelchange competitors, or a close vowel-change class and Class 6. There are 6 cases the network correctly inflects a novel item on the pattern of the most similar vowel change, but this response is slowed by a competing pull from class 6 (durf, rug, famb, pur, fud, luk). In a further five the interference comes from the next closest vowel change class (dif, saNk, tiv, zuNk, ræNg). In each of these cases, although the competitor is more distant it is still near enough to exert an influence (distances for these items are given in Table 9). In three additional cases, in fact, the second closest vowel-change class is strong enough to defeat the competition (pirf, girf, famb). In the final six examples the network settles on a Class 6 response, but only after inhibiting the activation of a vowel-change competitor (tæmb, kurs, zæN, domb, zæNk, fomb).
Default Generalization
22
TABLE 9. Settling time and competitor distance, simulation two
Input
Output
Output class
Expected class
Cycles to settle
Interfering item (and class)
Distance to: Expected
Class 6
Interfering item
bæz
bæzed
6
2
10
1.4
2.23
dæp
dæped
6
1
10
1.4
2.0
fus
fused
6
2
10
1.4
2.0
gup
guped
6
1
10
1.4
2.0
kob
kobed
6
4
10
1.7
2.0
korv
korved
6
2
10
1.7
2.0
læmb
læmbed
6
4
10
2.0
2.23
pant
panted
6
5
10
1.7
2.0
ræmb
ræmbed
6
4
10
2.0
2.23
ramb
romb
5
5
10
1.7
2.23
ved
veded
6
2
10
1.4
1.7
vurz
værz
3
3
10
1.4
1.7
zars
zarsed
6
3
10
1.4
1.7
durf
dærf
3
3
15
durfed (6)
1.4
2.23
famb
fomb
5
4
30
fambed (6)
1.4
2.23
fud
fod
5
5
50
fuded (6)
1.4
2.0
luk
lok
5
5
50
luked (6)
1.4
2.0
pur
pær
3
3
30
pured (6)
1.7
2.0
rug
rog
5
5
20
ruged (6)
1.4
2.0
dif
daf
1
1
20
dif (2)
1.4
2.0
2.0 (2)
ræNg
ruNg
4
4
45
roNg (2)
1.4
2.23
1.7 (5)
saNk
soNk
5
5
20
suNk (4)
1.7
2.23
2.0 (4)
tiv
tav
1
1
30
tiv (2)
1.4
2.0
2.0 (2)
zuNK
zoNk
5
5
30
zuNk (4)
1.7
2.23
2.0 (4)
fomb
fombed
6
4
200
fumb (4)
1.4
2.23
2.0 (4)
girf
garf
1
3
20
gærf (3)
1.4
1.73
1.7 (1)
pirf
parf
1
3
20
pærf (3) pirfed (3)
1.4
1.7
1.7 (1)
domb
dombed
6
4
50
dumb (4)
2.0
2.23
kurs
kursed
6
3
40
kærs (3)
1.4
2.0
tæmb
tæmbed
6
4
20
tumb (4)
2.0
2.23
zæN
zæNed
6
4
40
zoN (5) zuN (4)
1.7
2.0
2.0 (5)
zæNk
zænked
6
5
40
zoNk (5) zuNk (4)
1.7
2.23
2.0 (4)
Default Generalization
1.7 (5)
23
These results show that competing possibilities interfere with each other during the production of inflected forms, slowing that process. If this is correct, it predicts that human subjects will also be slowed in their response times when inflecting a verb that is a relatively good fit to more than one inflectional class.
7.0 Discussion The results show that a multi-layer network is capable of developing a default category even in the absence of superior type frequency. Following the suggestion in Plunkett and Marchman (1991), our test assumes that the crucial factor is not the size of the default class, but the structure of the non-default classes. It is striking that the structure required for a network to learn a relatively small default class is that which is found in the real-language examples of the phenomenon. Default categorization has been treated as an example of the sorts of phenomena that require rule-based accounts of inflectional morphology, and the fact that some languages exhibit default inflectional categories of relatively small size has been put forth as evidence of the inadequacy of the connectionist approach. The claim has been made that because networks learn by principles of similarity, default categories can only be learned in cases where the categories are broadly populated, so that novel forms (which are to be classified by the default mapping) may have some existing form which can serve as an attractor. The above results, however, demonstrate that connectionist models can successfully capture so-called ‘default’ behavior and that in multilayer network architectures, generalization to novel patterns is not strictly dependent on their similarity (measured as distance in space) to known items. To understand why this happens, it is useful to imagine the stimuli as lying in an input landscape (see Figure 5). This landscape represents dimensions which are relevant to the classification of the forms; they may include phonological, morphological, syntactic, semantic, etc., characteristics. The position of a word in this space thus corresponds to its featural description. The job of the network (and, we believe, language users) is to learn which regions of this input space correspond to which categories; in the present task, these different categories then require the network to produce different morphological alternants on the output. Networks that lack a hidden layer may fail at this task in cases where words which are featurally very different must be classified in the same way. Such two-layer networks are able to group forms only on the basis of input similarity. Referring to the hypothetical space in Figure 5, this means that form DEF1 and DEF2 could not be placed in the same category, because each form is closer to some other form than to each other. Hidden layers, on the other hand, permit the network to form abstract internal representations which do not depend solely on input-based similarity. Instead, the network can represent items which are physically very different as similar because they belong to the same abstract category.
Default Generalization
24
CATEGORY 6 (the table top—the default)
input dimension 2 CATEGORY 2
input dimension 1 def2
CATEGORY 3
def1
CATEGORY 1 CATEGORY 4
CATEGORY 5
FIGURE 5. Hypothetical input landscape. Forms are defined in terms of input characteristics in two dimensions (x and y axes); their proximity to one of the 5 basins in the landscape determines their categorization. The default category is the flat surface of the “table top.” Two items, def1 and def2 are also shown. Although these forms are each closer to another category than to each other, they fall outside the 5 basins of attraction which define the non-default categories. Since they are on the table top, which defines the “elsewhere condition”, they are classified as belonging to category 6 (the default)
In the current simulations, there are two conditions which together are responsible for the emergence of the default category. First, the phonologically well-defined classes occupy bounded regions in the input space. These items form attractors which serve as strong prototypes for analogizing to novel forms; these attractors are shown in Figure 5 as deep wells or basins. The likelihood of a novel form being treated as a member of this class depends on how close the novel form is, in input space, to the basin (i.e., how phonologically similar it is). In the figure, we see five such basins. The rest of the space, the “table top” lying outside any of these basins, is what constitutes the default. The first requirement, then, is that the non-default categories be sufficiently well-defined and narrowly defined that their basins of attraction are restricted. Secondly, the default category itself must be represented by items which are spread throughout the remaining space. It is not necessary that this space be well-populated; in the current simulations a very few exemplars were required. What is necessary is that these examples serve to isolate the regions of attraction of the non-default categories (more precisely, they establish hyperplanes around those basins). The effect of both conditions is that the network learns, through a relatively few examDefault Generalization
25
ples, that any item which does not resemble one of the five well-defined classes, is to be treated in the same way. Thus, this region is defined negatively. It is in fact precisely the “elsewhere condition” which is often defined as the default. We point out that these two conditions need not co-occur in natural language. Furthermore, both conditions may be satisfied to greater or lesser extents. This implies that the condition of defaulthood itself falls along a continuum. One might imagine languages in which the inflectional morphology has no default categories; other languages which contain a single well-defined default; and other languages which have multiple defaults of graded strength and differing degrees of applicability. Further empirical work is obviously required; at this point we hypothesize that indeed such a spectrum of defaults is to be observed cross-linguistically and that that variation will correlate with the strength of the factors studied here.
Default Generalization
26
References Bybee, J. (in press) Regular Morphology and the Lexicon.Language and Cognitive Processes. Bybee, J. (1985) Morphology: a study of the relation between meaning and form. Philadelphia: John Benjamins. Bybee, J.L., and Moder, C.L. (1983). Morphological Classes as Natural Categories, Language 59:251-270. Daugherty, K.G., and Hare, M.L. (1994). What’s in a rule? The past tense by some other name might be called a connectionist net. In M.C. Mozer, P. Smolensky, D.S. Touretsky, J.L. Elman, and A.S. Weigend (Eds.) Proceedings of the 1993 Connectionist Models Summer School. Hillsdale, N. J.: Erlbaum Associates. Daugherty, K.G., and Seidenberg, M.S. (1992). The Past Tense Revisited, in The Proceedings of the 14th Annual Meeting of the Cognitive Science Society, Princeton, N.J.: Erlbaum . Daugherty, K.G. and Seidenberg, M.S. (in press) “Beyond rules and exceptions: A connectionist modeling approach to inflectional morphology,” in S. Lima (Ed.), The Reality of Linguistic Rules. John Benjamins. Hare, M.L. and Elman, J.L. (in press). Learning and Morphological Change. Cognition. Hare, M.L. and Elman, J. L. (1992). A connectionist account of English inflectional morphology: Evidence from language change. The Proceedings of the 14th Annual Meeting of the Cognitive Science Society, Princeton, N.J.: Erlbaum. Kim, J., et. al. (1991). Why No Mere Mortal Has Ever Flown Out to Center Field, Cognitive Science 15: 173-218. McCarthy, J., and Prince, A. (1990) Food and word in prosodic morphology: the Arabic broken plural. Natural Language and Linguistic Theory 8:209-283. Minsky, M. and Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Pinker, S., and Prince, A., (1988). On Language and Connectionism: Analysis of a Parallel Distributed Processing Model of Language Acquisition, Cognition 28: 73-193. Plunkett, K., and Marchman, V. (1991). U-shaped learning and frequency effects in a multilayered perceptron: Implications for child language acquisition. Cognition 38: 3-102. Plunkett, K., and Marchman, V., (1993) From Rote Learning to System Building, Cognition. Prasada, S., Pinker, S., and Snyder, W. (1990). “Some evidence that irregular forms are retrieved from memory but regular forms are rule generated,” presented at the PsyDefault Generalization
27
chonomic Society meeting, Nov. 1990. Prasada, S., and Pinker, S., (1993) Generalization of Regular and Irregular Morphological Patterns, Language and Cognitive Processes, vol. 8, pp. 1-56, 1993 Quirk, R., and Wrenn, C.L., (1975). An Old English grammar, London: Methuen. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan. Rumelhart D. E., Hinton, G., and Williams R., (1986) Learning internal representations by error propagation. In Parallel Distributed Processing Vol. I. (Rumelhart, D. E. and McClelland, J. L., Eds.) Rumelhart, D., and McClelland, J. (1986). On Learning the Past Tense of English Verbs. In Rumelhart, D., and McClelland, J., eds. Parallel Distributed Processing, Vol. II. Cambridge, MA: MIT Press. Seidenberg, M. and Bruck, M. (1990). Consistency effects in the generation of past tense morphology. Paper presented at the thirty-first meeting of the Pyschonomics Society (November: New Orleans). Wright, J., and Wright, E. (1925). Old English Grammar. Oxford: Oxford University Press.
Default Generalization
28
Appendix: Network responses to test set Input:
Response of networks:
expect example clos- dised of closest est tance vowel class class to closest class
distance to class 6
1
2
3
4
5
dærf daf fod ruNg rog tav soNk værz guped lok kursed fombed bæzed dæped fused veded zarsed fomb parf
dærf daf fod ruNg rog tav soNk værz guped luked kursed fombed bæzed dæped fused veded zarsed fomb parf
dærf daf fod ruNg rog tav soNk værz gap lok kærs fombed bæzed dæped fused veded zarsed fomb parf
dærf daf fod roNg rog tav soNk værz gap lok kærs fombed bæzed dæped fused veded zarsed fomb parf
dærf æ daf a fuded o ruNg u rog o tav a soNk o værz æ gap a luked o kursed æ fombed u bæzed i dæped a fused i veded i zarsed æ fomb u parf æ
derf dip,did fad riNg rag tid,tik sad verz gip lak kers fimb bez dip fes ves zers fimb perf
3 1 5 4 5 1 5 3 1 5 3 4 2 1 2 2 3 4 3
1.4 1.4 1.4 1.4 1.4 1.4 1.7 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4
2.23 2.0 2.0 2.23 2.0 2.0 2.23 1.7 2.0 2.0 2.0 2.23 2.23 2.0 2.0 1.7 1.7 2.23 1.7
more distant: zuNK zoNk ramb romb pur pær korv korved kob kobed domb dombed læmb læmbed pant panted tæmb tæmbed zæNk zænked ræmb ræmbed zæN zæNed girf garf
zoNk romb pær korved kobed dombed læmbed panted tæmbed zuNk ræmbed zæNed garf
zoNk romb pær korved kobed dombed læmbed panted tæmbed zæNked ræmbed zuN garf
zoNk romb pær korved kobed dombed læmbed panted tæmbed zæNked ræmbed zæNed garf
zoNk romb pær korved kobed dombed læmbed panted tæmbed zæNked ræmbed zæNed garf
zak rag perf kev kimb kimb pind pag,pak kimb zak zimb zing terf
5 5 3 2 4 4 4 5 4 5 4 4 3
1.7 1.7 1.7 1.7 1.7 2.0 2.0 1.7 2.0 1.7 2.0 1.7 1.4
2.23 2.23 2.0 2.0 2.0 2.23 2.23 2.0 2.23 2.23 2.23 2.0 1.73
close durf dif fud ræNg rug tiv saNk vurz gup luk kurs fomb bæz dæp fus ved zars famb pirf
Default Generalization
o o æ i u u u a u o u u æ
29