Expression control using synthetic speech. Brian Wyvill (
[email protected]) and David R. Hill (
[email protected]) Department of Computer Science, University of Calgary. 2500 University Drive N.W. Calgary, Alberta, Canada, T2N 1N4 Abstract This tutorial paper presents a practical guide to animating facial expressions synchronised to a rule based speech synthesiser. A description of speech synthesis by rules is given and how a set of parameters which drive both the speech synthesis and the graphics is derived. An example animation is described along with the outstanding problems. Key words: Computer Graphics, Animation, Speech Synthesis, Face-Animation. © ACM, 1989. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published as Course # 22 of the Tutorial Section of ACM SIGGRAPH 89, Boston, Massachusetts, 31 July - 4 August 1989. DOI unknown.
Note (drh 2008): Appendix A was added, after publication of these tutorial notes by the ACM, to flesh out some details of the parameter synthesis, and to provide a more complete acoustic parameter table (the original garbled table headings have been corrected in the original paper text that follows, but the data was still incomplete—intended for discussion in a tutorial). Fairly soon after the animation work using the formant synthesiser was finished a completely new articulatory speech synthesis system was developed by one of the authors and his colleagues. This system uses an acoustic tube model of the human vocal tract, with associated new posture databases—cast in terms of tube radii and excitation, new rules, and so on. Originally a technology spin-off company product, the system was the first complete articulatory real-time text-to-speech synthesis system in the world and was described in [Hill 95]. All the software is now available from the GNU project gnuspeech under a General Public Licence. Originally developed on the NeXT computer, much of the system has since been ported to the Macintosh under OS/X, and work on a GNU/Linux version running under GNUStep is well under way (http://savannah.gnu.org/projects/gnuspeech). Reference
[Hill 95]
David R. Hill, Leonard Manzara, Craig-Richard Schock. Real-time articulatory speechsynthesis-by-rule, Proc. AVIOS ‘95, the 14th Annual International Voice Technologies Applications Conference of the American Voice I/O Society, San Jose, September 11-14 1995, AVIOS: San Jose, 27-44
Expression control using synthetic speech
1
1 Motivation In traditional hand animation synchronisation between graphics and speech has been achieved through a tedious process of analysing a speech sound track and drawing corresponding mouth positions (and expressions) at key frames. To achieve a more realistic correspondence a live actor may be filmed to obtain the correct mouth positions. This method produces good results but must be repeated for each new speech, it is time consuming and requires a great deal of specialised skill on the part of the animator. A common approach to computer animation uses a similar analysis to derive key sounds, from which parameters to drive a face model can he found. (see [Parke 74]) Such an approach to animation is more flexible than the traditional hand method since the parameters to drive such a face model correspond to the key measurements available from the photographs directly, rather than requiring the animator to design each expression as needed. However, the process is not automatic, requiring tedious manual procedures for recording and measuring the actor. In our research we were interested in finding a fully automatic way of producing an animated face to match speech. Given a recording of an actor speaking the appropriate script, it might seem possible to design a machine procedure to recognise the individual sounds and to use acoustic-phonetic and articulatory rules to derive sets of parameters to drive the Parke face model. However, this would require a more sophisticated speech recognition program than is currently available. The simplest way for a computer animator to interact with such a system would be to type in a line of text and have the synthesised speech and expressions automatically generated. This was the approach we decided to try. From the initial input, given the still incomplete state of knowledge concerning speech synthesis by rules, we wanted to allow some audio editing to allow improvements in the speech quality, with the corresponding changes to the expressions being done automatically. Synthetic speech by rules was the most appropriate choice since this can be generated from keyboard input, it is a very general approach which lends itself to the purely automatic generation of speech animation. The major drawback is that speech synthesised in this manner is far from perfect.
2 Background 2.1 The Basis for Synthesis by Rules Acoustic-phonetic research into the composition of spoken English during the 50’s and 60’s, led to the determination of the basic acoustic cues associated with forty or so sound classes. This early research was conducted at the Haskins Laboratory in the US and elsewhere worldwide. The sound classes are by no means homogeneous, and we still do not have complete knowledge on all the variations and their causes. However, broadly speaking, each sound class can be identified with a configuration of the vocal organs in making sounds in the class. We shall refer to this as a speech posture. Thus, if the jaw is rotated a certain amount, and the lips held in a particular position, with the tongue hump moved high or low, and back or forward, a vowel-like noise can be produced that is characterised by the energy distribution in the frequency domain. This distribution contains peaks, corresponding to the resonances of the tube-like vocal tract, called formants. As the speaker articulates different sounds (the speech posture is thus varying dynamically and continuously), the peaks will move up and down the frequency scale, and the sound emitted will change. Figure 1 shows the parts of the articulatory system involved with speech production.
Expression control using synthetic speech
2
Figure 1: The Human Vocal Apparatus
2.2 Vowel and Consonant Sounds The movements are relatively slow during vowel and vowel-like articulations, but are often much faster in consonant articulations, especially for plosive sounds like /b, d, g, p, t, and k/ (these are more commonly called the stop consonants). The nasal sounds /m, n/ and the sound at the end of “running”—/ŋ/, are articulated very much like the plosive sounds, and not only involve quite rapid shifts in formant frequencies but also a sudden change in general spectral quality because the nasal passage is very quickly connected and disconnected for nasal articulation by the valve towards the back of the mouth that is formed by the soft palate (the velum )—hence the phrase “nasal sounds”. Various hiss-like noises are associated with many consonants because consonants are distinguished from vowels chiefly by a higher degree of constriction in the vocal tract (completely stopped in the case of the stop consonants). This means that either during, or just after the articulation of a consonant, air from the lungs is rushing through a relatively narrow opening, in turbulent flow, generating random noise (sounds like /s/, or /f/). Whispered speech also involves turbulent airflow noise as the sound medium, but, since the turbulence occurs early in the vocal flow, it is shaped by the resonances and assumes many of the qualities of ordinarily spoken sounds.
2.3 Voiced and Voiceless When a sound is articulated, the vocal folds situated in the larynx may be wide open and relaxed, or held under tension. In the second case they will vibrate, imposing a periodic flow pattern on the rush of air from the lungs (and making a noise much like a raspberry blown under similar conditions at the lips). However, the energy in the noise from the vocal folds is redistributed by the resonant properties of the vocal and nasal tracts, so that it doesn’t sound like a raspberry by the time it gets out. Sounds in which the vocal folds are vibrating are termed voiced. Other sounds are termed voiceless , although some further qualification is needed.
Expression control using synthetic speech
3
It is reasonable say that the word cat is made up of the sounds /k æ t/. However, although a sustained /æ/ can be produced, a sustained /k/ or /t/ cannot. Although stop sounds are articulated as speech postures, the cues that allow us to hear them occur as a result of their environment. When the characteristic posture of /t/ is formed, no sound is heard at all: the stop gap, or silence, is only heard as a result of noises either side, especially the formant transitions (see 2.4 below). The sounds /t/ and /d/ differ only in that the vocal folds vibrate during the /d/ posture, but not during the /t/ posture. The /t/ is a voiceless alveolar stop, whereas the /d/ is a voiced alveolar stop, the alveolar ridge being the place within the vocal tract where the point of maximum constriction takes place, known as the place of articulation. The /k/ is a voiceless velar stop.
2.4 Aspiration When a voiceless stop is articulated in normal speech, the vocal folds do not begin vibrating immediately on release. Thus, after the noise burst associated with release, there is a period when air continues to rush out, producing the same effect as whispered speech for a short time (a little longer than the initial noise burst of release). This whisper noise is called aspiration, and is much stronger in some contexts and situations than others. At this time, the articulators are moving, and, as a result, so are the formants. These relatively rapid movements are called formant transitions and are, as the Haskins Laboratory researchers demonstrated, a powerful cue to the place of articulation. Again, these powerful cues fall mainly outside the time range conventionally associated with the consonant posture articulation (QSS) itself.
2.5 Synthesis by Rules The first speech synthesiser that modelled the vocal tract was the so-called Parametric Artificial Talker (PAT), invented by Walter Lawrence of the Signals Research and Development Establishment (SRDE), a government laboratory in Britain, in the 1950’s. This device modelled the resonances of the vocal tract (only the lowest three needed to be variable for good modelling), just the various energy sources (periodic or random), and the spectral characteristics of the noise bursts and aspiration. Other formulations can serve as a basis for synthesising speech (for example, Linear Predictive Coding—LPC), but PAT was not only the first, but is more readily linked to the real vocal apparatus than most, and the acoustic cue basis is essentially the same for all of them. It has to be, since the illusion of speech will only be produced if the correct perceptually relevant acoustic cues are present in sufficient number. Speech may be produced from such a synthesiser by analysing real speech to obtain appropriate parameter values, and then using them to drive the synthesiser. This speech is merely a sophisticated form of compressed recording. It is difficult to analyse speech automatically for the parameters needed to drive synthesisers like PAT, but LPC compression and resynthesis is extremely effective, and serves as the basis of many modern voice response systems. It is speech by copying, however, and always requires preknowledge of what will be said, contains all the variability of real speech. More importantly it is hard to link directly to articulation. A full treatment of speech analysis is given in [Witten 82].
2.6 Speech Postures and the Face It is possible, given a specification of the postures (i.e. sound classes ) in an intended utterance, to generate the parameters needed to drive a synthesiser entirely algorithmically, i.e. by rules,
Expression control using synthetic speech
4
Figure 2: Upper lip control points without reference to any real utterance. This is the basis of our approach. The target values of the parameters for all the postures are stored in a table (see table 1), and a simple interpolation procedure is written to mimic the course of variation from one target to the next, according to the class of posture involved. Appropriate noise bursts and energy source changes can also be computed. It should be noted that the values in the table are relevant to the Hill speech structure model see [Hill 78]. Since the sounds and sound changes result directly from movements of the articulators, and some of these are what cause changes in facial expression (e.g. lip opening, jaw rotation, etc.), we felt that our program for speech synthesis by rule could easily be extended by adding a few additional entries for each posture to control the relevant parameters of Parke’s face model.
2.7 Face Parameters The parameters for the facial movements directly related to speech articulation are currently those specified by Fred Park. They comprise: jaw rotation; mouth width; mouth expression; lip protrusion; /f/ and /v/ lip tuck; upper lip position; and the x, y and z co-ordinates of one of the two mouth corners (assuming symmetry which is an approximation). The tongue is not represented, nor are other possible body movements associated with speech. The parameters are mapped onto a group of mesh vertices with appropriate scale factors which weight the effect of the parameter. An example of the polygon mesh representing the mouth is illustrated in Figure 2.
3 Interfacing Speech and Animation The system is shown diagramatically in Figure 3. Note that in the diagram the final output is to film, modifications for direct video output are discussed below. The user inputs text which is automatically translated into phonetic symbols defining the utterance(s). The system also reads various other databases. These are: models of spoken rhythm and intonation; tabular data defining the target values of the parameters for various
Expression control using synthetic speech
Figure 3: Animated speech to film—system overview
5
Expression control using synthetic speech
6
articulatory postures (both for facial expression, and acoustic signal); and a set of composition rules that provide appropriate modelling of the natural movement from one target to the next. The composition program is also capable of taking account of special events like bursts of noise, or suppression of voicing.
3.1 Parameteric Output The output of the system comprises sets of 18 values defining the 18 parameters at successive 2 millisecond intervals. The speech parameters are sent directly to the speech synthesiser which produces synthetic speech output. This is recorded to provide the sound track. The ten face parameters controlling the jaw and lips are described in [Hill 88] and are taken directly from the Parke face model. Table 2 shows the values associated with each posture.
3.2 Face Parameters The facial parameter data is stored in a file and processed by a converter program. The purpose of this program is to convert the once per two millisecond sampling rate to a once per frame time sampling rate, based on the known number of frames determined from the magnetic film sound track. This conversion is done by linear interpolation of the parameters and resampling. The conversion factor is determined by relating the number of two millisecond samples to the number of frames recorded in the previous step. This allows for imperfections in the speed control of our equipment. In practice, calculations based on measuring lengths on the original audio tape have proved equivalent and repeatable for the short lengths we have dealt with so far. Production equipment would be run to standards that avoided this problem for arbitrary lengths. The resampled parameters are fed to a scripted facial rendering system (part of the Graphicsland animation system [Wyvill 86]). The script controls object rotation and viewpoint parameters whilst the expression parameters control variations in the polygon mesh in the vicinity of the mouth, producing lip, mouth and jaw movements in the final images, one per frame. A sequence of frames covering the whole utterance is rendered and stored on disc. Real time rendering of the face is currently possible with this scheme—given a better workstation.
3.3 Output to film The sound track, preferably recorded at 15 ips on a full track tape (e.g. using a NAGRA tape recorder), is formatted by splicing to provide a level setting noise (the loudest vowel sound as in gore ) for a few feet; a one frame-time 1000 Hz noise burst for synchronisation; a 23 frame time silence; and then the actual sound track noises. The sound track is transferred to magnetic film ready for editing. The actual number of frames occupied by the speech is determined for use in dealing with the facial parameters. Getting the 1000 Hz tone burst exactly aligned within a frame is a problem. We made the tone about 1.5 frames in length to allow for displacement in transferring to the magnetic film. the stored images are converted to film, one frame at a time. After processing, the film and magnetic film soundtrack are edited in the normal way to produce material suitable for transfer. The process fixes edit synch between the sound track and picture based on synch marks placed ahead of the utterance material. The edited media are sent for transfer which composes picture and sound tracks onto standard film.
Expression control using synthetic speech
7
3.4 Video Output Many animators have direct workstation to video output devices. Experience has shown that the speed control on our equipment is better than anticipated so that our present procedure is based on the assumption that speed control is adequate for separate audio and video recordings, according to the procedures outlined above, and for straight dubbing to be carried out onto the video, in real time, once the video has been completed. With enough computer power to operate in real time, it would be feasible to record sound and image simultaneously.
4 Sampling Problems Some form of temporal aliasing might seem desirable, since the speech parameters are sampled at 2ms intervals but the facial parameters at only 41.67ms (film at 24 frames/sec.) or 33.33 (video at 30 frames/sec). In practice antialiasing does not appear to be needed. Indeed, the wrong kind of antialiasing could have a very negative effect, by supressing facial movements altogether. Possibly it would be better to motion-blur the image directly, rather than antialiasing the parameter track definitions. However, this is not simple as the algorithm would have to keep track of individual pixel movements and processing could become very time consuming. The simplest approach seems quite convincing, as may be seen in the demonstration film.
5 A practical example 5.1 The animated speech process The first step in manufacturing animated speech with our system is to enter text to the speech program which computes the parameters from the posture target-values-table. Text entered at keyboard: Input as: (Phonetic Representation)
speak to me now bad kangaroo s p ee k t u m i n ah uu b aa d k aa ng g uh r uu
Continuous parameters are generated automatically from discrete target data to drive the face animation and synthesized speech. The speech may be altered by editing the parameters interactively until the desired speech is obtained. This process requires a good degree of skill and experience to achieve human like speech. At this point all of the parameters are available for editing. Figure 4 shows a typical screen from the speech editor. The vertical lines represent the different parts of the diphones (transition between two postures). Three of the eight parameters (the three lowest formants) are shown in the diagram. Altering these with the graphical editor alters the position of the formant peaks which in turn changes the sound made by the synthesiser or the appearance of the face. Although the graphical editor facilitates this process, obtaining the desired sound requires some skill on the part of the operator. It should be noted that the editing facility is designed as a research tool. On the diagram seven postures have been shown—giving six diphones. In fact the system will output posture information every 2ms and these values will then be resampled at each frame time. The posture information is particular to the Parke face model. A full list of parameters corresponding to the phonetic symbols is given in table 2, as noted above.
Expression control using synthetic speech
Figure 4: Three of the speech parameters for “Speak to me now, bad kangaroo.”
8
Expression control using synthetic speech
9
5.2 Historical Significance of “Speak to me now ...” The phrase “Speak to me now, bad kangaroo” was chosen for an initial demonstration of our system for historical reasons. It was the first utterance synthesised by rule by David Hill at the Edinburgh University Department of Phonetics and Linguistics in 1964. It was chosen because it was unusual (and therefore hard to hear), and incorporated a good phonetic variety, especially in terms of the stop sounds and nasals which were of particular interest. At that time the parameters to control the synthesiser were derived from analog representations in metal ink that picked up voltages from a device resembling a clothes wringer (a “mangle”), in which one roller was wound with a resistive coil of wire which impressed track voltages proportional to the displacement of silver-ink tracks on a looped mylar sheet. The tracks were made continuous with big globs of the silver ink that also conveyed the voltages through perforations to straight pick-off tracks for the synthesiser controller that ran along the back of the sheet. When these blobs ran under the roller, a violent perturbation of the voltages on all tracks occurred. However, the synthesiser was incapable of a non-vocalic noise so, instead of some electrical cacaphony, the result was a most pleasing and natural belch.
6 Conclusion This paper has presented a practical guide to using the speech by rule method to produce input to the Graphicsland animation system. Our approach to automatic lip synch is based on artificial speech synthesised by simple rules, extended to produce not only the varying parameters needed for acoustic synthesis, but also similar parameters to control the visual attributes of articulation as seen in a rendered polygon mesh face (face courtesy of Fred Park). This joint production process guarantees perfect synchronisation between the lips and other components of facial expression related to speech, and the sound of speech. The chief limitations are: the less than perfect quality of the synthesised speech; the need for more accurate and more detailed facial expression data; and the use of a more natural face, embodying the physical motion constraints of real faces, probably based on muscle-oriented modelling techniques (e.g. [Waters 87] ). Future work will tackle these and other topics including the extension of parameter control to achieve needed speech/body motion synchrony as discussed in [Hill 88].
7 Acknowledgements The following graduate students and assistants worked hard in the preparation of software and video material shown during the presentation: Craig Schock, Corine Jansonius, Trevor Paquette, Richard Esau and Larry Kamieniecki. We would also like to thank Fred Parke who gave us his original software and data, and Andrew Pearce for his past contributions. This research is partially supported by grants from the Natural Sciences and Engineering Research Council of Canada.
Expression control using synthetic speech Symbol ^ H UH A E I O U AA EE ER AR AW UU R W L Y M B P N D T NG G K S Z SH ZH F V TH DH CH J
IPA SST silence 0 h 0 ə 0 ʌ 0 ɛ 0 ɩ 0 ɒ 0 ɷ 0 æ 0 i 0 ɜ 0 ɑ 0 ɔ 0 u 0 r 0 w 0 l 0 j 0 m 0 b 32 p 48 n 0 d 32 t 48 ŋ 0 g 32 k 48 s 0 z 0 ʃ 0 ʒ 0 f 0 v 0 ɵ 0 ð 0 ʧ 24 ʤ 8
TT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5 5 1 5 5 1 7 7 0 0 8 12 8 12 0 0 5 5
SST 32 16 2 4 4 4 18 2 4 4 3 2 18 18 44 47 44 43 39 39 39 36 36 36 34 34 34 36 36 35 35 38 38 37 37 35 35
TT 5 5 7 7 7 7 7 7 7 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 5 4 5 4 5 4 5 4 5
10 Flag 4 0 2 2 2 2 2 2 1 1 1 1 1 1 8 8 8 8 0 16 16 0 16 16 0 16 16 0 0 0 0 0 0 0 0 16 16
AX 0 0 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 0 47 47 0 47 47 0 0 40 0 40 0 40 0 40 0 20
F1 490 490 490 720 560 360 600 380 750 290 580 680 450 310 240 240 380 240 190 100 100 190 100 100 190 100 100 190 190 190 190 190 190 190 190 190 190
F2 1480 1480 1480 1240 1970 2100 890 950 1750 2270 1380 1080 740 940 1190 650 1190 2270 690 690 690 1780 1780 1780 2300 2300 2300 1780 1780 2120 2120 690 690 1780 1780 2120 2120
F3 AH2 FH2 AH1 2500 0 6000 0 2540 0 4000 4 2500 0 2500 0 2540 0 2500 0 2640 0 2500 0 2700 0 2500 0 2600 0 2500 0 2440 0 2500 0 2600 0 2500 0 3090 0 2500 0 2440 0 2500 0 2540 0 2500 0 2640 0 2500 0 2320 0 2500 0 1500 0 2500 0 2500 0 3000 0 3000 0 3500 0 3000 0 3500 0 2000 0 2000 0 2000 0 2000 0 2000 0 2000 0 3300 0 6000 0 3300 0 6000 0 3300 0 6000 0 2500 0 2300 0 2500 0 2300 0 2500 0 2000 0 3300 12 6000 0 3300 12 6000 0 2700 16 2300 12 2700 16 2300 12 3300 5 4000 12 3300 5 4000 12 3300 3 5000 0 3300 3 5000 0 2700 16 2300 0 2700 16 2300 0
Table 1: Rule Table for Speech Postures needed for synthesis (male voice)
[Author’s post-publication note: This table is incomplete and hard to follow, even with the headings ungarbled from the original. Please see Appendix A for more complete information]
Expression control using synthetic speech
Posture ^ H UH A E I O U AA AH EE ER AR AW UU R W L LL Y M B P N D T NG G K S Z SH ZH F V TH DH CH J
Parameter number IPA h ə ʌ ɛ ɩ ɒ ɷ æ a i ɜ ɑ ɔ u r w l ɫ j m b pp n d t ŋ g k s z ʃ ʒ f v ɵ ð ʧ ʤ
4 0 5 6 7 8 4 10 3 10 5 3 3 11 11 5 3 5 5 5 1 0 0 0 0 0 2 4 6 5 1 1 1 1 2 2 6 6 3 3
12 1 1.3 1 1.1 1.15 1.15 1 0.85 1.15 1.1 1.2 0.9 1 1 0.8 0.9 0.85 1.15 1.15 1.15 1.15 1 1 1.2 1.2 1.2 1.15 1.2 1.2 1.2 1.2 0.85 0.8 1 1 1.15 1.15 0.85 0.85
14 0 0 7 0 5 5 2 15 0 2 0 9 2 2 30 9 15 0 0 0 -1 -1 -1 0 0 0 0 0 0 0 0 25 27 0 0 0 0 30 30
15 0 10 10 25 3 3 20 0 2 25 0 0 25 25 25 0 25 3 3 10 0 0 0 8 5 5 10 5 5 10 10 0 0 2 2 0 0 0 0
11
16 0 -15 -8 -7 -10 -15 0 15 -3 -10 -10 5 -8 -8 15 5 15 -15 -15 -15 -15 -15 -15 -15 -15 -15 -15 -15 -15 -15 -15 -5 -7 -5 -5 -15 -15 -7 -7
17 0 10 5 8 15 15 0 -5 20 1 25 -5 -2 -2 -20 -5 -18 18 18 18 18 2 2 12 12 12 18 12 12 18 18 -5 -7 3 3 19 19 -7 -7
18 0 0 5 -3 -2 5 -5 3 -5 0 2 1 0 0 0 1 0 3 3 3 3 0 0 0 3 3 3 3 3 5 5 0 0 0 0 0 0 0 0
Table 2: Face Parameter Table for Phonetic Values
21 0 -3 -5 -6 -1 -1 0 0 0 0 -4 0 0 0 0 0 -5 -5 -5 -5 -8 -8 -8 -5 -8 -8 -5 -8 -8 -8 -8 1 1 -20 -20 0 0 1 1
22 0 4 5 4 7 6 2 7 0 10 2 7 3 3 5 7 5 5 5 10 2 2 2 10 13 13 10 13 13 16 20 20 20 10 10 4 4 20 20
47 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
48 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Expression control using synthetic speech
12
References [Hill 78]
David Hill. A program structure for event-based speech synthesis by rules within a flexible segmental framework. Int. J. Man-Machine Studies, 3(10):285-294, May 1978.
[Hill 88]
David Hill, Brian Wyvill, and Andrew Pearce. Animating Speech: an automatic approach using speech synthesis by rules. The Visual Computer, 3(5):277-289, 1988.
[Parke 74] Fred I. Parke. A Parametric Model for Human Faces. University of Utah, Dept. of Computer Science, Dec 1974. PhD dissertation. [Waters 87] Keith Waters. A muscle model for animating three-dimensional facial expression. In Computer Graphics, volume 21. ACM SIGGRAPH, July 1987. [Witten 82] I.H. Witten. Principles of computer speech. Academic Press, London, England, 1982. [Wyvill 86] Brian Wyvill, Craig McPheeters, and Rick Garbutt. The University of Calgary 3D Computer Animation System. Journal of the Society of Motion Picture and Television Engineers, 95(6):629-636, 1986.
s
k
g
ng
t
d
n
p
b
m
y
l
w
r
uu
aw
ar
er
ee
aa
u
o
i
e
a
uh
h
^
Basic posture
h ə ʌ ɛ ɩ ɒ ɷ æ i ɜ ɑ ɔ u r w l j m b p n d t ŋ g k s
IPA
-2.0
-2.0
-2.0
0.5
-2.0
-2.0
0
-2.0
-2.0
0
0.5
0
0.25
0
0.50
-0.5
-0.5
-0.25
0.5
-0.5
0.25
-0.5
0
0
0
0
0
0
Ì
190
100
100
190
190
100
190
190
100
190
240
320
340
240
309
449
677
581
285
748
376
599
569
569
722
581
500
581
f1
3300
2500
2500
3100
3300
3300
3300
2000
2000
2000
2700
2300
2500
1300
2320
2635
2540
2436
3088
2460
2440
2605
2636
2636
2537
2436
2500
2436
f3
4450
3500
3500
4000
4200
4200
4200
3320
3320
3320
3700
3500
3320
2200
3320
3700
3410
3500
3700
3450
3320
3220
3500
3500
3500
3500
3500
3500
f4
227
227
227
250
227
227
250
227
227
250
227
227
227
227
227
227
227
227
227
227
227
227
227
227
227
227
227
227
fnnf
1
1
1
0.4
1
1
0.25
1
1
0.5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
nb
0
0
41
57
0
52
57
0
47
54
60
54
60
54
60
60
60
60
60
57
60
60
60
60
60
60
33
0
ax
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
42
0
18
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
ah1 ah2
7000
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
1500
fh2
200
1000
1000
1000
1500
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
bwh2
78/112
82/104
60/80
50/68
60/80
58/86
60/88
86/92
72/82
62/114
54/84
72/84
48/68
56/86
120/234
170/290
158/272
198/252
124/212
126/182
74/162
104/156
90/174
90/174
114/198
46/74
50/66
90/140
Duration umarked/ marked (msec)
Table A1: Rule table for speech postures needed for synthesis (male voice) clarified (Data as used in DEGAS formant-based synthesis [Manzara 92]—numerical entries not in bold type are default values and usually not significant for the posture)
1300
2300
2300
2450
1780
1850
1850
460
300
950
2400
1700
650
1100
939
737
1083
1381
2373
1746
950
891
1965
1965
1236
1381
1500
1381
f2
co
co,st
co,st,vo
na,co,vo
co,st,ch
co,st,ch,vo
na,co,vo
co,st,ch
co,st,ch,vo
na,co,vo
co.gl,vo
co,gl,vo
co,gl,vo
co,gl,vo
vd,lv,vo,ba
vd,lv,vo,ba
vd,lv,vo,ba
vd,lv,vo
vd,lv,vo
vd,vo
vd,sv,vo,ba
vd,sv,vo,ba
vd,sv.vo
vd,sv,vo
vd,sv,vo
vd,sv,vo
as
si
Categories: all have their own identity and all are phones (ph)
Continued on next page … (with key)
30/30
30/30
30/30
28/28
24/24
18/18
26/26
18/18
16/16
16/16
38/38
22/22
34/34
40/40
66/66
66/66
66/66
66/66
66/66
66/66
66/66
66/66
66/66
66/66
66/66
66/66
26/26
0/0
Transition unmarked/ marked (msec)
Expression control using synthetic speech 13
APPENDIX A
j
ch
dh
th
v
f
zh
sh
z
z ʃ ʒ f v ɵ ð ʧ ʤ
IPA
-1.0
-1.0
-2.0
-1.0
-2.0
-1.0
0
-2.0
-1.0
Ì
Categories ph phone ma marked vd vocoid gl glide co contoid di diphthong st stopped ch checked lv long-vowel sv short-vowel as aspirate fr fricative vo voiced af affricate na nasal si silence ba backvowel
Basic posture
190
190
100
100
100
100
190
190
190
f1
5000
6000
3950
3950
4000
4000
5000
6000
4450
f4
227
227
227
227
227
227
227
227
227
fnnf
1
1
1
1
1
1
1
1
1
nb
The posture durations are given for both unmarked and marked versions variants. Provision is made for marked/unmarked transition durations, but in this dataset, the transition durations are the same whether marked or not.
Special features (e.g. noise bursts) are added by rules according to the context.
2700
2700
2850
2850
2600
2600
2700
2700
3300
f3
Column Headings Ì microintonation effect f1 formant 1 target frequency f2 formant 2 target frequency f3 formant 3 target frequency f4 formant 4 target frequency fnnf nasal formant frequency nb nasal formant bandwidth ax larynx amplitude ah1 aspiration amplitude ah2 frication amplitude fh2 frication frequency peak bwh2 frication filter bandwidth
2000
2000
1500
2000
690
690
2000
2000
1300
f2
0
0
0
0
0
0
0
0
0
0
0
27
33
30
30
56
56
18
ah1 ah2
1500
1500
1350
1350
1600
1600
1500
1500
7000
fh2
1000
1000
1000
1000
1000
1000
500
500
200
bwh2
Notes
94/100
118/118
80/108
74/136
68/88
90/118
58/82
64/124
56/84
Duration umarked/ marked (msec)
40/40
24/24
44/44
40/40
44/44
40/40
30/30
24/24
30/30
Transition unmarked/ marked (msec)
co,vo,af
co,af
co,vo
co
co,vo
co
co,vo
co
co,vo
Categories: all have their own identity and all are phones (ph)
(b) For dipthongs, triphthongs, etc. (i.e. vowel-to-vowel), we are dealing with a special case, identified by encountering a special symbol for each part of the compound sound, probably supplied by a preparser. Let dp represent a posture
(a) The transition time for a vowel to other sound, or other sound to vowel is specified by the other sound. Thus TTp to p+1 is.given by: TTp to p+1 = TTp+1 for vowel to other sound TTp to p+1 = TTp for other sound to vowel The QSS time for a vowel to other sound or other sound to vowel is given by taking the transition time already allocated out of the vowel total time. Then: QSSvowel = Totalvowel - TTp to p+1 QSSother = Totalother (from the table)
Assume we are constructing a diphone out of postures p and p+1. Diphone duration = QSSp / 2 + TTp to p+1 + QSSp+1 / 2
40
0
44
0
45
0
51
0
54
ax
Expression control using synthetic speech 14
[Manzara 92] Leonard Manzara & David R. Hill. DEGAS: A system for rule-based diphone synthesis. Proc. 2nd. Int. Conf. on Spoken Language Processing, Banff, Alberta, Canada, October 12-16, 117-120
Reference
(f) The shape of a given transition is determined by which of the postures involved is checked or free. There is very little deviation from the steady state targets of a checked posture, during the QSS, and the transition begins or ends fairly abruptly. For a free posture, there is considerable deviation from the target values during the QSS. To obtain an appropriate shape, the slopes are computed so that if the slope during checked posture is m, then the slope during a free posture is 3m and the slope during the transition is 6m. Then, if the total movement from the target in posture 1 to the target in posture 2 is ¢, then: ¢ = ∙1.QSSp + ∙2.TTp to p+1 + ∙3.QSSp+1 where ∙1 ∙2 and ∙3 are the slopes as described, according to the type of p and p+1. Since the relationship between ∙1 ∙2 and ∙3 are known in terms of m , as is the value of ¢, the value of m may be calculated and substituted for the individual segment slopes
Movement during QSS of p2 (slope ∙3)
Movement during QSS of p1 slope (∙1)
Transition movement p to p+1 (slope ∙2)
(e) The target values for /h/ or silence to posture, or posture to /h/ or silence, are taken from the posture (i.e. the transitions are flat).
(d) The target values for /k, g, ŋ / must be modified before back vowels /ɒ, ɷ, ɑ, ɔ, u/. The second formant target should be put 300 hz above the value for the vowel. This is an example of a rule that uses the posture categories.
(c) The transition time from other sound to other sound is a fixed 10 msecs, taken out of the QSS of the following sound: TTp to p+1 = 10 QSSp+1 = QSSp+1 - 10 (this is pretty arbitrary, but works and is probably not critical)
component of a diphthong, triphthong, … . Diphthongs are not directly represented in the data tables as they involve a succession of vowel postures—the component vowel postures are close to similar isolated vowels but not necessarily identical. There are three cases in constructing diphones involving these postures. (i) dp to p (ii) p to dp and (iii) dp to dp Cases (i) and (ii) are handled as described in (a), treating the dp as the vowel. For case (iii) QSSp and QSSp+1 are taken from the table and then: TTp to p+1 =Totaldp - QSSp / 2 - QSSp+1 / 2
Expression control using synthetic speech 15