A Phonological Modeling System Based on ... - Semantic Scholar

5 downloads 0 Views 133KB Size Report
of the rules are deduced from articulator trajec- tory data. The output of this system is gestural scores that model phonological alternations in continuous speech.
A Phonological Modeling System Based on Autosegmental and Articulatory Phonology Jiping Sun and Li Deng

Speech and Information Processing Research Group University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 fjsun,[email protected]

Abstract

This paper describes the design and implementation of a phonological modeling system based on autosegmental and articulatory phonology and its application in speech recognition. Pronunciation modeling is an integral part in speech recognition systems. Together with language modeling, signal processing and learning models (e.g. Hidden Markov model and neural network model), it in uences the performance of a speech recognition system. The basic units adopted by most speech recognition systems for word pronunciation modeling are linearly ordered diphones or triphones (i.e. context-dependent segments without internal structure). Our approach decomposes segments into articulatory features and allows these features to spread freely into neighbouring segments to model continuous speech. We use a feature-based nite-state transducer as the metalanguage for writing phonological rules. A set of rules has been implemented in this language. The writing of these rules are based on theories of autosegmental and articulatory phonology. The numerical components of the rules are deduced from articulator trajectory data. The output of this system is gestural scores that model phonological alternations in continuous speech. The gestural scores are used to generate Hidden Markov Models for building speech recognition systems.

1 Introduction

Pronunciation modeling (S. Bird, 1995),(S. Bird and D. R. Ladd, 1991), (C. P. Browman and L. Goldstein, 1989) describes how words are pronounced in terms of linearly arranged pronunciation units (e.g. phonemes or allophones). The most common example of pronunciation modeling is a dictionary that gives phonetic al-

phabets of words. In speech recognition (L. Deng, 1997), (L. Deng and D. Sun, 1994), it is desirable to have models depicting how words are pronounced in continuous speech environment. This is because the speech signals, usually presented to a recognition system in the form of Fourier function ltered spectrograms, have severe distortions for words in continuous speech context from when they are said carefully and in isolation. To recognize words in continuous speech context, the system must have good knowledge about how the pronunciations change. Currently, most systems use a unison, unstructured approach toward this context dependency phenomenon, most notably the triphone approach. The triphone approach is the most common one in phone-based systems of speech recognition and production. In a triphone approach, every pronunciation unit, such as a phoneme, expands to a set of variants (called triphones) which take into account phonemes to its left and right. For example, corresponding to a phonemic description of the word bag, which may be /b ae g/, the triphone-based description would be /#-b+ae b-ae+g ae-g+#/, in which each unit is a context-sensitive phonemic variant. The assumption under this approach is that such a discriminative treatment should classify pronunciation units well enough for achieving satisfactory recognition. Practically, this approach has not been successful, especially with rapid, informal speech. For example, in our group's experiments with Switchboard speech corpus, a well-known test suite for continuous speech recognition, triphone-based word recognition rate is only around 60 percent. The situation described in the last paragraph shows that contextual in uence may come from larger distances beyond the immediate left and

right units. Then how about using quintphones? Due to the combinatorial e ects, this leads to scarcity of training data, which renders this approach impractical. What is more important is that factors in uencing the pronunciation of a phoneme may come from other dimensions, such as syllabic constituents, word and sense group boundaries, stress patterns, etc.). In order to take into account of all these factors, we need a system for describing regularities and making predictions in a more intelligent way. In terms of rule-based modeling, an automaton-based metalanguage without feature structures deals with pronunciation units only as unstructured primitives. This type of modeling fails to capture the common characteristics existing between di erent units. Such common characteristics are frequently based on the shared features. Therefore, a feature-based system can naturally express these common characteristics and establish rules in terms of these features. This will make a rule-based system more ecient and theoretically more elegant. Based on such considerations, we chose to use a feature-based system for the modeling task. This sets the background for our phonological modeling system development. In this paper, we rst describe the problem phenomena and the theoretical basis of our system development. This is followed by the description of a feature-based, cascaded nite state transducer as the metalanguage for writing phonological rules. The implemented phonological rules are then applied by the transducer to input phoneme strings to predict varying pronunciations in continuous speech. The output of the system is in the form of gestural scores. Finally we describe how the system output can be applied to Hidden Markov Modelbased speech recognition systems.

2 The Phoneme Alternation Problem

That phonemes undergo various kinds of changes or alternations in continuous speech has long been observed in phonetic and phonologic literature (A. C. Gimson, 1980). Explanations as to the cause of such changes often constitute a good knowledge source for developing a computing model. The most frequently quoted

causes for such alternations is the assimilation of features from phonemes in context. First let us look at some examples of the pronunciation alternations, in which changes in certain features are described. (In the explanations, we use PoA to denote place of articulation).  Allophonic alternations within a word 1. /t/ in try; PoA = post-alveolar, in uenced by /r/. 2. /n/ in tenth; PoA = dental, in uenced by /th/. 3. /u:/ in music; PoA = central, in uenced by /j/. 4. /r/ in cry; devoiced, in uenced by /k/. 5. /m/ in mean or seem; Lip = spread, in uenced by /i:/.  Allophonic alternations at word boundaries 1. /t/ in not that; PoA = dental, in uenced by /dh/. 2. /l/ in at last; partly-devoiced, in uenced by /t/. 3. /w/ in at once; partly-devoiced, in uenced by /t/. 4. /j/ in thank you; partly-devoiced, in uenced by /k/. 5. /ax/ in bring another; nasalized, in uenced by /ng/.  Phonemic alternations within a word 1. /n/ in encounter becomes /ng/; PoA = velar, in uenced by /k/. 2. /b/ in absolutely becomes /p/; devoiced, in uenced by /s/. 3. /s/ in issue becomes /sh/; PoA = palatal, in uenced by /y/.  Phonemic alternations at word boundaries 1. /n/ in ten birds becomes /m/; PoA = labial, in uenced by /b/. 2. /s/ in miss you becomes /sh/; PoA = palatal, in uenced by /y/. 3. /d/ in send to loses release; stricture = closed, in uenced by /t/ in to. 4. /dh/ in with thanks becomes /th/; devoiced, in uenced by /th/.

5. /z/ in these socks becomes /s/; devoiced, in uenced by /s/. The above examples of pronunciation alternation and numerous others encountered in our research have shown: 





Many pronunciation variations are attributable to the assimilation of features of phonemes in context. While most of these changes are in uenced by immediate context, others involve larger distances. For example, a nasal consonant may cause nasalization of a vowel not adjacent to it. Alternations caused by feature assimilation may be constrained by other conditions. For example, the unaspiration of /p/ in speech requires /s/ to be in the same onset cluster as /p/. So, the /s/ in newspaper is unlikely to have this alternation, because /s/ is a coda of the rst morpheme.

What are the basic characteristics of this phenomenon of assimilation? And what kind of feature system is appropriate for describing phoneme alternations (including allophonic and phonemic changes)? Before we describe our computing system design and implementation, it is essential to review some basic theories relevant to our questions and the problem at hand.

3 Theoretical Foundations

The autosegmental phonologic theory presented in (J. A. Goldsmith, 1990) and (J. Harris, 1994) provides us with a framework for describing feature assimilation at the abstract level. Autosegmental phonology rst requires a linearly arranged segment string as the input. This segment string is put at a tier named melodic tier. Secondly, it requires the existence of several other autonomous tiers. These other tiers come from two sources: either from the features of the segments on the melodic tier or from supra-segmental structures such as syllabic constituents, boundaries, stress patterns, tonal nucleus, vowel harmony patterns, etc. At the outset, each element on the other tiers is linked to one and only one element on the melodic tier by association lines. Then there is the basic principle of autosegmental phonology: Most

phonological processes can be reduced to two primitive operations in terms of association lines { the insertion and the deletion of these lines. Based on this principle, the assimilation is simply an insertion operation which links a feature to more than one phoneme. Figure 1 gives an autosegmental phonological account of lip feature assimilation (lip rounding) as an association line insertion operation. Lip =round

/k/ voice-

Lip=round

/u:/

/k/

voice+

voice-

/u:/ voice+

Figure 1: Assiciation line insertion In autosegmental phonology, phonemic features are usually arranged in what is called the melodic geometry with a hierarchical structure. An example of such an organization is shown in Figure 2. Any insertion and deletion operation on a class node will a ect its daughter nodes. This brings eciency to rule writing and makes it easier to group phonemes into natural classes. root melodic tier stricture

laryngeal place features

features features

Figure 2: Melodic geometry The theory of autosegmental phonology provides us with a framework to develop a computing system of phonological modeling in which phoneme alternations in continuous speech can be predicted. In order to express actual rules in a metalanguage, we need to substantiate the phonemic features and the tiers on which the features can be located speci cally. In this aspect we now turn to the materials discussed in (C. P. Browman and L. Goldstein, 1989). In their paper the featural system for phonemes is set in the background of articulatory phonology. In this theory, there are ve autonomous featural tiers on which articulatory feature can be placed. These tiers are based on articulator geometry and arranged as shown in Figure 3.

Vocal Tract Supralaryngeal GLO

Oral

Nasal VEL

Tongue

TB

LIP

TT

Figure 3: Articulatory tiers Here, the ve tiers are Glo (glottis), Vel (velum), TB (tongue body), TT (tongue tip) and Lip (lips). On each tier, there are several dimensions of features with speci c values. These features are related to articulator dynamics in pronunciation, which are listed below: 



Five articulator tiers: 1. Lips 2. Tongue Tip 3. Tongue Body 4. Velum 5. Glottis Features on these tiers and their possible values: 1. Constriction Degree (values: closed, critical, narrow, mid, wide), 2. Constriction Location (values: protruded, labial, dental, alveolar, postalveolar, palatal, velar, uvular, pharyngeal), 3. Constriction shape (values: distributed, narrowed, retro ex, rounded), 4. Sti ness (values: sti , relaxed).

The system described in (C. P. Browman, L. Goldstein, E. Saltzman and C. Smith, 1986) and in (C. P. Browman and L. Goldstein, 1989) is designed for generating gestural scores showing temporal relations between articulatory features when they overlap in time. In (C. P. Browman and L. Goldstein, 1989), a temporal speci cation of a feature for a segment is called a

gesture. It is assumed there that these gestures can overlap in time, and more importantly, that \simple changes in the patterns of overlap between neighbouring gestural units can automatically produce a variety of super cially di erent types of phonetic and phonological variation" (ibid p. 211). Our development of the computing model is based on these two branches of phonological description. The purpose of our work is to model temporal overlaps of articulatory features so that phoneme alternations due to contextual in uence in continuous speech can be predicted by the application of phonological rules. As the actual pronunciation of an utterance may take a myriad of possible forms, the system must be non-deterministic and probabilistic in nature. This sets the background for the design of our phonological modeling system.

4 Design and Implementation of the Phonological Modeling System

With respect to the above presentation of the problem and characteristics of phoneme alternation, we set up the following principle for the computing modeling system:  The rules must be able to state feature identi cations of phonemes, and must be able to state agreement of features at different locations.  The rules must be able to state feature assimilation from some source to some target phonemes.  The rules must be able to refer to phonemic contexts beyond the immediate adjacency.  The rules must be able to refer to high-level factors, such as word, sense group and utterance boundaries, sentence stresses, tone nucleus and syllabic constituents, etc.  The rules must lead to a non-deterministic or probabilistic system in terms of the output predictions. Based on the above requirements, a featurebased, cascaded nite-state transducer has been built. This transducer maps an input phoneme symbol string to an output feature structure series by applying several groups of rules: the output of applying one group of rules is sent as input to the process of applying the next group

of rules. The advantage of dividing rules into groups is a better organization and higher eciency. For example, at the start, rules which derive syllable structure and stress patterns are applied. This is followed by rules for gestural overlap. At each stage, only relevant rules are used and this makes the processing modular and ecient. Basically there are two generic types of rules: one is for deriving supra-segmental structures such as syllabic constituents, boundaries and stresses, the other is for gestural overlap. The rule writing formalism consists of a regular expression backbone with featural node descriptions on the input side. On the output side non-determinacy is guranteed by allowing more than one output expression. The output expressions recognize indexed feature variables and do uni cation before instantiating the variables in the output. To put it brie y, a phonological rule takes the form: LC MP RC ! OE where LC (left context), MP (mappted portion), RC (right context) and OE (output expression) are the four components of a rule. These components can be further speci ed as follows:

LC := Element*. RC := Element*. Element :=

not(Term) (not this) j or(Term+) (one of these) j Term* (zero or more of this) j Term+ (one or more of this) j Termo (zero or one of this). Term := (F = V)* (feature and value pairs). F := any atomic symbol. V := atomic symbol j Term j Indexed Variable. OE := Term+. One interesting characteristic of this transducer is that it can change the length of the input string when it comes to output. This is realized by allowing the MP and OE to have di erent lengths and may be used to model phoneme deletion. If a term is an empty feature structure, then it can match any input feature structure. Based on the above rule writing formalism, rules for deriving high-level structures

and for gesture overlap have been implemented. These rules are applied by the cascaded nite state transducer in several batches to map the input phoneme symbol string into a gestural score representation. Before the rules are applied, the input symbols are changed into initial feature structures by a lexicon lookup process, in which the phonemes are given features such as natural class, canonical articulatory features, etc. The following are a few examples of phonological rule. 1. Rule { \Recognizing a consonant as a syllable coda" This rule derives boundary and syllabicconstituent features for a consonant when it is at the end of a word: LC: [syl=n],[syl=c]* RC: or([bnd=wd],[bnd=ue]) MP: [phn=x1,cls=c] OE: [phn=x1,bnd=we,sts=0,syl=c] This rule says that if left context consists of a phoneme with syllabic-constituent = nucleus, followed by zero or more phonemes with syllabic-constituent = coda and if the right context consists of either a word or an utterance end boundary and if the mapped portion consists of a phoneme with its natural class = consonant, then transduce the mapped phoneme by adding boundary = word-end, stress = 0, syllabicconstituent = coda and keep the phoneme name unchanged. 2. Rule { \Spreading a lip feature to the left phoneme" This rule realizes lip feature assimilation: LC: [syl=o] RC: [ ] MP: [phn=x1,syl=n,lip=x2(r),tb=x3] OE: [phn=x1, lip-dur=10, lip-olp=6, lip=x2, tb-dur=8, tb-olp=5, tb=x3] This rule says that if left context consists of a phoneme which is at the syllable onset position and if the mapped portion consists of a phoneme which is at the syllable nucleus position and with a lip feature = rounded, then map this phoneme

structure into a new feature structure in which gesture overlap parameters are given: (a) The phoneme name remain unchanged, (b) The articulatory features (lip and tongue body) remain unchanged, (c) The lip gesture duration = 10, (d) The lip gesture overlap (to the left phoneme's gesture) = 6, (e) The tongue body gesture duration = 8, (f) The tongue body gesture overlap = 5. It is to be noticed that in the above examples the OE only gives a single output expression. If multiple output possibilities are to be generated, paralell OE's can be expressed there. Figure 4 depicts the gestural scores produced by the transducer (after post-processing to change an internal representation into a geometric representation) for the phrase go to school. Lip

ow

ow

T.T.

uw

uw s

t

T.B. gow ow

l

uw

k

uw

Vel Glo Sec. 0

t

s

k 1

Figure 4: Gestural scores: go to school In Figure 4, the lip=rounded feature is assimilated by /g/, /t/ and /k/. Other phoneme alternation types involving articualtor gesture overlap can be predicted in the similar manner by the phonological rules described above, such as those exempli ed in section 1. In so far

as these phenomena are due to feature assimilation, the phonological rules can simulate the processes accurately. The abstract machine of the cascated nite state transducer functions as a rule applicator on input strings. The basic strategy of rule application is to order the rules so that more speci c rules are applied rst. \More speci c" means a rule has larger context scope and more detailed feature speci cation. More speci c rules have their output expressions given higher preference scores. Generally we do not envisage a multi-leveled feature assimilation process. In such a process, change of feature in some phoneme in turn in uences change of feature in other phonemes. This kind of treatment tends to complicate the system with subtle directional issues and complex conditions. For example, in breathe slowly, if /dh/ changes to /th/ as in uenced by /s/, this change may in turn in uence the /i:/ so the latter may change to the shor vowel /I/. This is a multi-level alternation process. In our system, we do not let rules to behave in this way. Rather, we expand the context so that more than one feature change can be described as once in one rule. This makes the rule application procedure simple and rule writing easier. It is suitable now to say something about the quantitative aspects in the overlap rules. There are two kinds of parameters which an overlap rule should specify { durations of gestures and overlaps of gestures with respect to a neighbouring segment. In our system, these numbers come from an annotated articulator trajectory database. The original database is the University of Wisconsin Microbeam X-ray speech database. An annotation system in java has been created to capture individual gestures at each tier, which provide temporal information to overlap rules.

5 Application in Speech Recognition

This system for predicting phoneme alternations in continuous speech has been developed as part of a speech recognition system. The gestural scores derived from an input string of phoneme symbols have direct correpondences with Hidden Markov Models used in building a speech recognition system. In Hidden Markov Model-based speech recognition, word or phone

models are created with linearly ordered states. These states accept speech input signals (transferred into spectrogram coecients), and output probabilities that these models can generate such signals. The key to this strategy is that these models have been trained with signals of know linguistic categories, such as a word or a phoneme. Pronunciation modeling in this respect is to dictate how many and what kind of states each model should have. The importance here is that the states in di erent models will be shared, or tied in the HMM terminology. This tying will guarantee the ecient use of training data and make the models more robust. Take the word-based models for example, if two words have some common phonemes, the models for these two words should have shared states corresponding to the common phonemes. The same is true of phone-based models because phones can have sub-segments and these can be common to other phones. The gestural scores are used just to derived such states. The step from gesture scores to linearly ordered states is carried by a twodimensional scan over time and the feature tiers: whenever a change has occurred in the vecomponent feature vector, a new state is obtained. In our experiments with the TIMIT speech database, which has over 6000 word forms in its dictionary and about 7000 sentences in the speech corpus, the overlap rules with are based on 46 phonemes in American English can generate around 1500 derived feature vectors. Either word-based or phone-based models can be produced by the phonological modeling system for building a speech recognition system.

6 Conclusion and Future Work

In this paper we have introduced the development of a phonological modeling system, its objective, theoretical foundation and application in speech recognition. We have developed a computing model for the phoneme alternation phenomenon and aimed at using it for speech recognition applications. The theoretical foundations have been solidly laid down. For practical reasons some simpli cations are made as to the feature assimilation process. For example, we dealt with multi-level alternation problem by using enlarged rule scopes. Perhaps when we

have a better understanding of this process, we can implement more susticated procedures. At present a basic set of overlap rules have been implemented. Our future work will see the implementation of a more complete rule set. In the mean time, the system will be used for practical experiements in speech recognition system building. In this respect, we hope the results will further improve our understanding of the phoneme alternation process and give us more hints as to the various types of alternations and their conditions. Currently our main attention is directed to assimilation as the major cause of alternation. The future work will investigate more reasons for alternation and aim at more complete simulations.

References

S. Bird: \Computational Phonology { A constraintbased approach" Cambridge University Press, 1995. S. Bird and T. M. Ellison: \One level phonology: autosegmental representations and rules as nite automata" Computational Linguistics, 20, 55-90. S. Bird and D. R. Ladd: \Presenting autosegmental phonology" Journal of Linguistics, 27, 193-210. C. P. Browman and L. Goldstein: \Articulatory gestures as phonological units" Phonology (6), 1989, pp. 201251. C. P. Browman, L. Goldstein, E. Saltzman and C. Smith: \A computational model for speech production using dynamically de ned articulatory gestures" JASA 80. S97. K. W. Church: \Phonological parsing in speech recognition" Kluwer Academic Publishers, 1987. L. Deng and D. Sun: \A statistical approach to ASR using atomic units constructed from overlapping articulatory features," J. Acoust. Soc. Am., Vol.95, 1994, pp. 2702-2719. L. Deng. \Autosegmental representation of phonological units of speech and its phonetic interface," Speech Communication, Vol.23, No. 3, 1997,pp. 211-222. A. C. Gimson: \An introduction to the pronunciation of English " Edward Arnold, 1980. J. A. Goldsmith: \Autosegmental and metrical phonology" Blackwell, 1990. C. Gussenhoven: \Understanding phonology" Arnold, 1998. J. Harris: \English sound structure" Blackwell, 1994. J. Sun and L. Deng: \Phonological rules for overlapping articulatory features in speech recognition" Proceedings of the Conference Paci c Association for Computational Linguistics, Waterloo, 1999, pp67-81. J. A. S. Kelso and B. Tuller: \Intrinsic time in speech production: theory, methodology and preliminary observations" In E. Keller and M. Gopnik (eds.) Motor and sensory processes of language. Lawrence Erlbaum. 203-222.