The Iterative Learning of Phonological Constraints - CiteSeerX

1 downloads 200 Views 233KB Size Report
For example (see section 5.3), from a general model of vowel harmony and data about Turkish .... settings. When provided with a new word in which each syllable is marked with stress levels (2 for strong ...... Department, Indiana University.
The Iterative Learning of Phonological Constraints T. Mark Ellison

From the earliest days of generative grammar, developing a general method for grammar acquisition has been a major goal of linguistic theory. Chomsky (1975) proposed that grammars be selected for their simplicity: the simplest grammar which fits the data is the best one. At that time, no computationally useful definition of simplicity has been available. Recently, the method variously known as minimum message length or minimum description length has proved very successful in selecting between hypotheses. This paper presents a simplicity measure for violable phonological constraints based on the minimum message length method. This measure captures the intuitive desiderata of conciseness, accuracy and precision. A family of constraints can be specified by parameterising a specific constraint, and so forming a template. The combination of this measure with a search algorithm is a powerful learning method for finding the best constraint matching a template and fitting a corpus. This method may be applied iteratively, using the same template, to learn a number of different constraints. Five applications of an implementation show some of the successes of this learning method: from learning consonant cluster constraints to vowel harmony. 1. Introduction This paper presents an algorithm for learning phonological constraint systems which are accessible and meaningful in terms of linguistic theory. The primary motivation for the work is the development of tools to assist linguists in making phonological generalisations about languages. The generalisations found by the program described here can form part of an analysis directly, or, alternatively, can suggest directions for investigation by the user. In the latter use, the program forms an important part of the grammar development environment described by Bird & Ellison (1992b). It is important to stress that modelling the process of language acquisition by speakers was not a motivation here. Consequently, psychological plausibilityhas not been a factor in developing the algorithm. This does not mean that there is no relationship between it and the psycholinguistics of acquisition or production, merely that such a relationship is fortuitous and not an issue for this paper. The data which is supplied to the learning program can be phonetic or phonemic. Orthographic data is used where it is approximately phonemic or allophonic. The generalisations which the program learns are generalisations familiar to phonologists: vowel harmonies, consonant cluster constraints, assimilations, etc.. In generative phonology, rules are transformations on representational structure which are  University of Edinburgh, Centre for Cognitive Science, 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland; [email protected].

c 1991 Association for Computational Linguistics

Computational Linguistics

Volume 20, Number 3

applied in a particular sequence to derive surface forms from underlying forms. In constraintbased phonology, constraints impose restrictions on possible forms, ruling out some candidates. There is no sense of a transformation of structure. These constraints may be prioritised, such as in Optimality Theory (Prince & Smolensky, 1993) or absolute, as in declarative theories of phonology (Bird, 1994; Broe, 1993). Phonology and constraint based phonology in particular are described in greater depth in section 3. This paper presents a method for the machine learning of phonological generalisations using the minimum message length (MML) paradigm (Wallace & Boulton, 1968; Wallace & Freeman, 1992), also known as the minimum description length (MDL) paradigm (Rissanen 1978, 1987)1. In this paradigm, a hypothesis is evaluated according to the conciseness of its own representation, which is a reflection of its simplicity, and the conciseness with which it allows the relevant data to be represented, reflecting the accuracy and precision of the hypothesis. The best hypothesis makes both these representations as concise as possible. Under this method, learning begins with a corpus of data, and a template defining a set of hypotheses. The learning program identifies the hypothesis in the set which has best evaluation according to the MML criterion. The measure and the resulting learning program are presented in section 4. The learning algorithm is iterative in the sense that from the same template and data, the algorithm can learn a number of non-identical constraints. It learns the constraints in sequence: the first constraint is learnt on its own; subsequent constraints are learnt by evaluating their independent contribution to the constraint system containing all previously discovered constraints. For example (see section 5.3), from a general model of vowel harmony and data about Turkish vowel sequences, the method can learn front harmony. Then, the best constraint to combine with front harmony to capture the structure of the vowel sequences is rounding harmony. This is one of five applications of the learning method described in section 5. 2. Related Work The primary motivation for this work is to develop a tool which assists linguists in building linguistic analyses from examples of phonological forms. There are therefore three desiderata which must be fulfilled to achieve this goal.

  

The tool must provide an overt statement of generalisation. The generalisation must be a linguistic one, and motivated by linguistic concerns. The analyses should be specified a priori as little as possible.

We can compare existing work in the machine learning of phonology against these three criteria to see how useful they are for achieving these aims. This work can cast into three major categories: connectionist, statistical and symbolic. 2.1 Connectionist learning Connectionist methods have recently been applied to learning phonology. The limitations of space preclude a detailed description of each of these applications, but a brief statement of a representative sample is given in (1).

1 (Li & Vitanyi, 1993) provide an excellent overview of this paradigm in its relation to algorithmic complexity theory, Bayesian inference and information theory.

2

T. Mark Ellison

(1)

Authors Rumelhart & McClelland (1986) Hare (1990) Goldsmith & Larson (1990) Daelemans et al. (1993)

Shillcock et al. (1993) Bellgard (1993)

Iterative Learning of Constraints

Topic back-propagation network learning to form the past tenses of English verbs a sequential network to model Hungarian vowel harmony modelling syllable structure with local interaction networks, learning Dutch stress placement using back-propagation (and comparing it with the results of other machine-learning techniques) simple architecture to learn phonological sequencing constraints an extended Boltzman machine to learn Turkish vowel harmony.

Let us look at one interesting example in more depth. Gasser & Lee (1989) have developed a connectionist network to learn morphological alternations of a particular kind. They present their network with the semantic content to be expressed and the part of the word so far constructed. The network is trained to predict the next phoneme. This approach to generating forms is psychologically plausible, they claim, as it generates words incrementally. The incremental learning system has succeeded in learning to construct morphologically complex forms in Turkish. These forms are particularly interesting because the system must apply the non-local constraints of vowel harmony to select the correct allomorph. The system must make phonological generalisations about the vowel sequences in order to identify the harmony class of the word roots. Thus, the system must have learnt the phonology governing Turkish vowel harmony. The second example considered by Gasser and Lee learns to generate plural forms of English nouns. Plural formation requires the distinction between voiced and unvoiced consonants. Learning to construct these related forms of words requires that the neural net abstract the phonological distinction of coronal/non-coronal and voiced/unvoiced. Connectionist systems do not communicate the generalisations which they learn. For this reason, they fail to match the first desideratum. While such systems may learn interesting structure about the phonology of a language, their inability to communicate it means that they can only function as a source of data, and not as a source of linguistic understanding. 2.2 Statistical learning Some statistical analyses share the same limitations as the connectionist models2 , in that they do not communicate generalisations about what they have learned, but are able to apply internal generalisations to new cases. An example of statistical learning of this kind is the instance-based learning of Dutch stress patterns described by Daelemans et al. (this volume). Their system constructs a database of Dutch words with the correct stress assignment. When a new word is encountered, the most similar word in the data base is identified and and used as a pattern to assign stress to the new word. In contrast, to learning systems which are mute about their generalisations, other statistical systems seek to find expressible generalisations about the data. A good exemplar of these is the system described by Powers (1991). His system identifies skeletal items using techniques developed for the analysis of syntax. In syntax, the closed class words, articles, prepositions,

2 In fact, connectionist systems can be viewed as one particular form of statistical analysis.

3

Computational Linguistics

Volume 20, Number 3

auxiliary verbs and others, do not function so much to carry significant information, but rather to frame the more contentful open-class words, such as nouns, verbs and adjectives, in a structured environment. Hence the term skeletal. Powers argues that closed class words can be identified by their statistical relationships to their immediate environment, and presents an algorithm for doing so. Powers applies the same technique to a large corpus of English text, in order to discover the skeletal components of English orthography. His program yields the five vowels as the primary structuring items of written English words. To the extent that English spelling is a phonological representation, albeit of Middle English, Powers’ system can be described as learning a significant aspect of phonology. While this system successfully learns to distinguish vowels from consonants as structuring rather than content-carrying items, and can even learn digraphs, it does not capture its generalisations in such a way that they have an immediate interpretation within a linguistic theory. The output of the system informs the linguist that the vowels form an interesting class, but does not indicate how to relate this class to the kinds of phonological structures or rules. In this way, Power’s method fails to comply with the second desideratum. 2.3 Symbolic learning There has been an increasing amount of research in learning symbolic phonological models couched in the framework of a linguistic theory. A sample of these are shown in table (2). (2)

Authors Johnson (1984) Touretzky et al. (1990)

Stethem (1991) Brent (1993)

Topic learning morphophonological alternations from paradigms given features, learning 3-level generative rules given underlying forms and surface forms — using the version spaces technique (Mitchell, 1978), using explanation-based learning to simplify phonological rule sets, learning English affixes from raw corpora, using MML.

Dresher & Kaye (1990) offer a representative example of a symbolic learning system. They propose a learning algorithm for stress systems, making use of stress-system parameters such including those of Halle & Vergnaud (1987). Each of the eleven stress parameters is associated with a set of CUES which identify parametersettings. When provided with a new word in which each syllable is marked with stress levels (2 for strong stress, 1 for weak stress, 0 for no stress) and syllabic analysis, the algorithm examines the cues associated with parameters to see whether this word requires a modification of any of the parameters. As more and more example words are given to the system, the parameter values should come nearer to the correct settings. For example, (3) gives their parameters P3 and P4. (3)

P3 P4

Feet are built from the left or right edge. Feet are stressed at the left or right edge.

There are cues for joint settings of these parameters. (4) gives the cues used for these two parameters. (4)

4

left-left Scanning from the left, a light syllable following any syllable must have stress 1.

T. Mark Ellison

Iterative Learning of Constraints

left-right Scanning from the left, a light syllable followed by any syllable must have stress 1. right-left Scanning from the right, a light syllable following any syllable must have stress 1. right-right Scanning from the right, a light syllable followed by any syllable must have stress 1. A word in which, scanning from the left, a light syllable with stress 2 follows another syllable contradicts the first cue, and so the parameter settings for P3 and P4 cannot be left-left. The algorithm that Dresher & Kaye (1990) describe will work with any language whose stress system can be described by the parameters used. The program does not interact with the user after the initial provision of data — in this sense, the program is untutored. The algorithm does, however, depend the preanalysis of data into syllables of various weights. With the exception of the Brent’s, all the programs or algorithms described the papers mentioned in (2) require a considerable amount of theory to be prespecified: underlying forms, syllabic parsings, paradigms or feature systems. This conflicts with the third desideratum, that the learning system depend as little preanalysis of the data as possible. In the closest antecedent to the current work, Ellison (1992, 1993) , describes three programs based on MML. These learn syllabicity distinctions, sonority hierarchies and vowel harmonies. In each case, the programs motivated by linguistic models, and return abstract analyses. Furthermore, the data supplied to the programs consists of sequential phonological transcriptions, taken from continuous speech or text, with word boundaries marked. Little preanalysis of the data is required. The algorithm described in the present paper is a generalisation of the vowel harmony program described by Ellison (1992, 1993). 3. Autosegmental Rules, Constraints and Templates This section introduces the phonological ideas required for understanding the learning program. The first two sections describe generative autosegmental phonology, so the reader familiar with this subject may wish to commence at section 3.3. Readers who are well-grounded in declarative phonology may also wish to skip sections 3.3 and 3.4. 3.1 Autosegmental phonology Since the late 1970s, the autosegmental paradigm (Goldsmith, 1976) has dominated phonology. The defining characteristic of this paradigm is the use of a particular kind of non-linear representation. The representations consist of two kinds of objects: tiers which are linear sequences of symbols called autosegments, and association lines drawn between autosegments on different tiers. Autosegments are names for particular kinds of phonological events extended through time, and the association lines can be interpreted as marking the overlap of these events (Sagey, 1988; Bird & Klein, 1990). In (5) the Turkish word geliyorum I am coming is cast in an autosegmental representation which highlights the vocalic feature [+front]. (5) KK ] [+front K

g

A

KK KK K

l

LL] [?front L

LL LL L

I

y

O

r

U

m

Here [+front], the property of being pronounced in the front of the mouth, is a feature of the vowels e, i o¨ and u, ¨ while [?front] is a feature of those vowels pronounced in the back of the 5

Computational Linguistics

Volume 20, Number 3

mouth, namely a, ı3 , o and u. The capital letters indicate archisegments in which the front-back opposition has been factored out: A is either of the low unrounded vowels a or e, I is either of the high unrounded vowels ı and i, O is either of the low rounded vowels o and o¨ , and U is either of the high rounded vowels u and u. ¨ The association line between A and [+front] shows that the vowel at this point must match both of these specifications — and it can only be the low unrounded front vowel e. 3.2 Rules The paradigm shift from segmental (Chomsky & Halle, 1968) to autosegmental phonology was a shift in representation, not in process. Phonological analyses remained generative with rewrite rules replacing sections of representations which matched their structural descriptions with the corresponding structural changes. The rules applied in a strict order or in cycles (6). Because an arbitrary number of transformations might separate underlying from surface forms, there need not be any obvious similarity between these two. transformation transformation

form (underlying) form ...

transformation transformation

form form (surface)

(6)

The effect of the paradigm shift was that the structural descriptions and changes became autosegmental representations, instead of strings. Let us now look at a typical phonological rewrite rule in autosegmental phonology, shown in (7a). This kind of rule is called a SPREADING rule. It asserts that if one autosegment is associated to another autosegment of a particular kind, then it is associated to the next autosegment of the same kind. The spreading rule shown in (7) spreads the feature [+front] across vowels. (7)

(a)

[+front] V

(b)

[+front < ]