Nonlinear modelling of double and triple period ... - Semantic Scholar

1 downloads 0 Views 316KB Size Report
Logopedics Phoniatrics Vocology. 2006; 31: 36/42. ISSN 1401-5439 print/ISSN 1651-2022 online © 2006 Taylor & Francis. DOI: 10.1080/14015430500320257 ...
Logopedics Phoniatrics Vocology. 2006; 31: 36 /42

ORIGINAL ARTICLE

Nonlinear modelling of double and triple period pitch breaks in vocal fold vibration FRITZ MENZER1,2, JONAS BUCHLI1, DAVID M. HOWARD3 & AUKE JAN IJSPEERT1 School of Computer and Communication Sciences, Ecole Polytechnique Fe´de´rale de Lausanne (EPFL), Switzerland, 2Media & Interaction Design Unit, Ecole Cantonale d’Art de Lausanne, Switzerland, and 3Media Engineering Group, Department of Electronics, University of York, UK 1

Abstract This paper reports a study on short-time subharmonic pitch breaks in vocal fold vibration, which are found to be a common feature of the human voice in spoken language. The observed pitch breaks correspond to a change in periodicity of the electrolaryngograph (Lx) signal. This paper presents a nonlinear dynamical system capable of producing time-series with subharmonic pitch breaks. The resulting time-series resemble closely Lx recordings of natural speech. The system is developed on the basis of a second order linear system, which is extended with a third dimension and nonlinear coupling terms. It is suggested that improved knowledge about pitch breaks could be used in future speech synthesis systems in order to improve the naturalness of the perceived output.

Key words: Nonlinear modelling, period doubling, pitch breaks, vocal folds

Introduction Considerable advances have been made in the electronic formant synthesis of speech, and the results can be highly intelligible (e.g., (1,2)). Speech synthesis systems are in common use in a number of applications, but the output speech is rarely mistaken as being that from a human speaker. A number of aspects potentially contribute to this including: limitations of the underlying source-filter model (3) in the case of formant synthesis; limitations of the rules used to drive the model; or an inappropriate glottal input signal. A demonstrable improvement in naturalness has been demonstrated for a simple three formant voiced only synthesizer when human gesture is used for the control of underlying formant, fundamental frequency and amplitude parameters (4). The importance of the glottal source in formant speech synthesis has been discussed by Holmes (5), and understanding the particular issue of double and triple period pitch breaks in vocal fold vibration is the subject of this paper. It is suggested that improved knowledge in this area could be used in

future speech synthesis systems with the potential to improve the naturalness of the perceived output. Thus this paper turns its attention to yet unmodelled phenomena. One example of such phenomena are double-triple period pitch breaks. Doubletriple period pitch breaks are a phenomenon occurring in vocal fold vibration that has not, to the best of our knowledge, been modelled before. But we believe, for the reasons detailed above, that modelling such phenomena may be important for achieving natural-sounding voice synthesis. Since nonlinear models of low complexity can successfully model the sound perception system (6,7) it seems natural to turn to nonlinear dynamical systems to model systems producing sound. A key advantage of nonlinear dynamical systems is that simple systems can be used to explain complex behaviour. In particular, the continuous, unspecific change of simple (i.e., scalar) control inputs (so-called ‘bifurcation’ or ‘control’ parameters) can induce series of fundamental, qualitative changes in the behaviour of such a system. The sequence of such changes, when continuously varying the bifurcation parameter, is

Correspondence: Jonas Buchli, Biologically Inspired Robotics Group, School of Computer and Communication Sciences, INN, Station 14, EPFL, Swiss Federal Institute of Technology, Lausanne, CH-1015 Lausanne, Switzerland. E-mail: [email protected]

ISSN 1401-5439 print/ISSN 1651-2022 online # 2006 Taylor & Francis DOI: 10.1080/14015430500320257

Nonlinear modelling of vocal fold pitch breaks called the ‘bifurcation cascade’. Bifurcation cascades allow the classification of dynamical systems. Systems with the same bifurcation cascades can qualitatively be modelled by canonical systems showing this bifurcation cascade. Thus, when modelling with dynamical systems, bifurcation behaviour is an important aspect to consider. Nonlinear dynamical systems have been successfully applied to a wide range of modelling problems in physics, biology, psychology and more recently, social sciences. Curiously, the double-triple period pitch breaks do not match the most commonly discussed bifurcation cascades. Whilst an important class of nonlinear systems shows successive period doubling (a socalled period doubling cascade), changes of periodicity with non-integer ratios are also observed in vocal fold vibration, particularly the change from a double to a triple period and vice versa. Here we propose a simple three-dimensional system that allows the production of time series that exhibit double-triple period pitch breaks. The model is essentially a nonlinear oscillator which can produce time series with single, double, and triple periods. What is new in this particular oscillator is the fact that with a simple control input, ‘direct’ pitch-breaks between these periods can be induced. We compare qualitatively the time series of our system with time series data recorded from human subjects using an electrolaryngograph (8), whose output waveform (Lx) monitors the variation of vocal fold contact area. A simultaneously recorded microphone signal was used to identify the speech sounds in which pitch breaks occurred. The Lx signal enables changes to higher order periods (of which period doubling is a special case) to be observed as a sequence of peaks that all have the same height turning into a sequence of small and large peaks. The occurrence of higher order periods implies the generation of subharmonics. The higher order periods observed in Lx signals often translate to a perceived pitch break in the speech signal, often also perceived as ‘roughness’ or ‘register breaks’. Over the past 15 years, nonlinear phenomena have received some attention in the field of voice research. Mende et al. (9) observed bifurcations in the cries of newborn infants, Baken (10) showed that increased nonlinearity in the stress-strain behaviour of vocal fold tissue can give an explanation for the fluctuations that are typical for the voice of elderly people. More recently, Laje et al. (11) showed that a nonlinear dissipation term in a one-mass model allows vocal fold movement to be described as relaxation oscillations and that coupling this model to a vocal tract model can lead to period doubling.

37

This article is structured as follows. First we give an introduction to nonlinear modelling, followed by our observations on pitch breaks in spoken voice. Our system is then presented, along with numerical results, followed by a more in-depth analysis of the system. We then discuss the observed phenomena and give an outlook on future research on the proposed system. Nonlinear modelling In order to understand dynamical systems and how they can serve for modelling physical systems, the concept of attractors, bifurcations and dependence on initial conditions are important. We can illustrate these concepts in a simple model. Imagine a ball on a landscape, the whole landscape being immersed in a very viscous liquid (e.g., oil) and the ball being very light. Thus, the movement of the ball will be damped and its velocity is proportional to the slope it currently is on. In such a system the ball rolls down to the next valley, i.e., the lowest point near the point of start and will stop there. Imagine now a simple landscape with only one valley (Figure 1a). This means that wherever the ball is placed in the beginning, in the end it will always be found in the same place, namely in the lowest point of the big valley. This point is the attractor of the system. Thus, this is a system with one attractor (the valley). The attractor is reached independently of initial conditions (starting point of the ball). Now take a slightly more complex landscape, with two (or more) valleys separated by hills (Figure 1b). This system has several possible stable end states, and it depends on the initial conditions (i.e., which side of which hill the ball starts) which of the valleys will be reached. Thus, a system can have more than one attractor, and this introduces dependence on initial conditions. Let us further modify the landscape. Instead of being fixed, the landscape can now be transformed by some external influence. In other words, we can imagine transforming a single valley landscape continuously into one with several valleys. There will be a point at which a new valley is just to appear or disappear, this is called a bifurcation point. The behaviour of the system changes fundamentally and dramatically at that point. Especially stable states can become instable. In Figure 1c such a transformation is illustrated. The attractors are marked by fine lines, there is a point where the single attractor splits up into two stable states. This is the bifurcation point. In dynamical systems such changes can often be controlled by just a single parameter; this parameter is then called the control or bifurcation parameter.

38

F. Menzer et al.

Figure 1. a) System showing one attractor. The ball rolls down the slope as indicated by the arrow, and the bottom of the valley is the attractor of the system. b) System with two attractors. Depending where the ball is placed initially the ball comes to rest in one or the other valley. The hill in between separates the two basins of attraction. c) Bifurcation from a system with one attractor to a system with two attractors. Note that the single attractor gets instable at the bifurcation point and two new attractors emerge.

As a matter of fact, bifurcations can repeatedly occur when further varying the bifurcation parameter. When we depict the development of the attractors versus the bifurcation parameter, a treelike structure can occur. This structure is called the bifurcation cascade. Thus, this structure tells us how the system behaves in steady state for different choices of the bifurcation parameter. It has been found that physically very different systems (as different as hydrodynamics, chemical reactions and mechanical systems) show qualitatively the same bifurcation cascade. Their dynamic behaviour can be modelled to a large extent with the simplest system showing the same cascade. Dynamical systems can have more sophisticated attractors, but these cannot be depicted with the above landscape-ball model (i.e., either the valley needs to be infinite or wrap over */or the ball needs the ability to jump). Probably the simplest of these attractors is the limit cycle. Systems exhibiting a limit cycle have an oscillatory steady state, i.e., the attractor is a ‘dynamic state’. Mathematically such systems can be described in very simple terms and the concepts of attractors, bifurcations and independence of initial conditions naturally generalize. Interestingly in such systems more interesting bifurcations can appear, e.g., a stable fixed point behaviour can be transformed into limit cycle behaviour, or limit cycles can be transformed into other limit cycles. It is worth noting that systems showing more than one attractor, or a limit cycle, are necessarily nonlinear (this can be shown by plotting the slope */it will no longer be a straight line). Dynamical systems and their attractor properties are interesting for physical modelling because, in natural systems, there are always fluctuations present, and only processes that are robust are observed. The attractor behaviour of dynamical systems is naturally robust, i.e., the ball returns to the attractor state when it is slightly moved away from it or will possibly go into a new attractor if it gets kicked further away. It has been realized that many real systems can be modelled to a large extent with low dimensional nonlinear dynamical systems. A big advantage is that the modelling includes not only static behaviour but also dynamic behaviour. There are two ways of influencing the system: 1) perturbations that can make the system switch attractor, and 2) changing the attractor structure by varying bifurcation parameters. In this article we investigate the second possibility. Observations of pitch breaks In this work, Lx and speech signals from six different trained phoneticians (three female and

Nonlinear modelling of vocal fold pitch breaks

39

the discussion, the nonlinear dynamical systems viewpoint gives a hint for possible explanations of this fact. Nonlinear model for double and triple period pitch breaks

Figure 2. The vocal fold vibration can rapidly switch between single, double and triple periods. The example here corresponds to the word ‘score’ where the vibration starts with a double period, switches to a triple period and finally a single period. Our model reproduces this behaviour from a very simple control input. Top: control input and model output; Bottom: Lx.

three male) were used. They were recorded via an electrolaryngograph and a Bruel and Kjaer half-inch omni-directional measurement microphone in an anechoic room reading speech excerpts containing five sentences (see Appendix). The speakers took between 40 and 52 seconds to read the sentences. All of the speakers from time to time produced periodic signals with multiple Lx cycles per period (one cycle being one closure and open sequence), but the overall number of occurrences varied greatly between speakers. The speaker that was the least prone to multiple periods only produced them twice during the whole excerpt, whilst the speaker with the most multiple periods produced them in 31 different occasions. The higher order periods we observed occurred for durations ranging from 20 ms to 240 ms. The multiple periods are particularly common with the vowels in ‘her’ and ‘score’ as well as with words containing ‘r’. In these cases, the pitch break in the voice can often be perceived. A typical example can be seen in Figure 2 where the Lx signal starts with a double period, then shows a triple period for a short time before undergoing another pitch break to a single period. Another case where the multiple periods can occur is when the mouth closes as in the voiced bilabial plosive /b/. In this case the pitch break cannot be heard. In some cases the pitch break can also be somewhat obscured when adjacent to a fricative such as /s/. Double periods are by far the most common higher order periods and being sustained for the longest duration (see Table I). As we will see in

The way a nonlinear dynamical system may be used as a model for the real vocal folds is that the model’s state variables correspond to the dynamically changing variables of the vocal folds (e.g., positions, velocities, etc.) while the model parameters correspond to vocal fold properties that do not change by themselves (e.g., length, weight, air pressure difference between lungs and mouth, muscle tension, etc.). The distinction between state variables and parameters depends strongly on the definition and the scope of the model. If the model includes also the brain of the speaker, only few parameters would be left */such as the text he is reading. Our model is based on the assumption that something interacts with the airflow above the vocal folds. This interaction could be a constriction of the airflow, but as our model is not directly physically interpretable, the type of interaction is not specified in detail. The model has three state variables and three parameters. Two state variables are related to the position and the velocity of the vocal folds, while the third state variable is related to the transglottal air pressure difference. The parameters can be related to the air pressure difference between the lungs and the mouth, the ‘power’ of the lungs, and the strength of the interaction with the airflow. However, all these relations have no direct physical meaning. We propose the following model (the derivation and details will be explained below): dx=dt dy=dt dz=dt

zx y x zy b(a z)

c(arctan(x)

p)(x2

y2 )

We concentrate on variations of the one parameter (c) that models the strength of the interaction with the airflow. This parameter is used as a control parameter for the pitch breaks. The parameter c is Table I. Summary of our observations of higher order periods in spoken language. The occurrences are summed over all six speakers and indicate in how many distinct places the higher order periods were found. The maximum number of cycles indicates how many consecutive n-peak-cycles have been found where n is the period multiple (i.e., 10 consecutive double periods with a total of 20 peaks have been found). Clearly the double periods are the most common, followed by the triple periods. Period multiple

2

3

4

5

Occurrences Maximum number of cycles

61 10

9 2

4 2

1 1

40

F. Menzer et al.

also a bifurcation parameter; that is why it can induce such fundamental changes into the behaviour of the system. As expected (because we use a nonlinear dynamical system), the system shows complex behaviour for very simple input via the control parameter. Figure 3 shows how any transition between single, double and triple periods can be achieved using simple step control inputs for c. The analogy with the real vocal folds is that suddenly changing a parameter (e.g., a muscle tension) would allow change from a single to a triple period (or any other combination of periodicities). The transitions between single and double periods correspond to the well known period doubling bifurcation and are very easily achieved. The observed transitions involving double and triple periods are possible because in a certain parameter region1 where the two attractors coexist, the state-space contains two basins of attraction. When the bifurcation parameter is changed suddenly, the system may go to one or the other attractor, depending on where the system was in its state-space at the instant when the bifurcation parameter changed. In other terms, whether the system goes to a double or triple period depends on the precise timing of the parameter change within one period of oscillation. Transposed to the real vocal

folds this would mean that a very precise timing of a parameter change (such as muscle flexion at a precise instant of time) would be necessary to be able to control the generation of double or triple periods. As it is probably impossible to achieve such a precise timing, double or triple periods would occur randomly, which is supported by our observations. The basin of attraction of the double period attractor is larger than the basin of attraction of the triple period attractor (data not shown). This makes the double periods more likely to occur than triple periods. According to our observations of real vocal fold signals the double periods seem to occur much more often than triple periods. Construction of our model Based on the assumption that the sub-harmonic pitch breaks are due to a constriction of the airflow, we constructed a dynamical system that */conceptually */models the interaction between the vibration of the vocal folds and the transglottal air pressure difference. The proposed system is based on a linear, harmonic oscillator that is coupled to a first order system, which is the same approach also used for the construction of the Ro¨ssler system (12).

Figure 3. All possible pitch breaks between single, double and triple periods. The graph on top shows the control parameter c(t) (a step function) and the model output x(t). The graph below shows an excerpt of a measured Lx signal with the same pitch break.

Nonlinear modelling of vocal fold pitch breaks The equation system Equation 1, Equation 2 below shows a second order damped/activated oscillatory system which is at the base of the proposed system. The stability of this second order system is controlled by the parameter s. It is stable for s B/0 and unstable for s /0. dx=dt

sx

y

(1)

dy=dt

x

sy

(2)

The oscillation of the system is meant to model the vibration of the vocal folds, while the parameter s is conceptually related to the transglottal air pressure difference: if s is greater than zero, the oscillator’s amplitude increases, and if s is smaller than zero, the amplitude decreases. A positive value of s therefore corresponds to a pressure difference that is large enough to sustain vocal fold vibration while a negative value of s corresponds to a pressure difference that does not sustain vocal fold vibration. In order to model the constriction in the airflow, the same first order system as in the Ro¨ssler System is introduced: dz=dt

b(a

z)

(3)

The form of Equation 3 above is based on the hypothesis that when there is no influence due to air flowing through the vocal folds, the transglottal air pressure difference tends towards an equilibrium at a value a. The parameter b determines how fast the system tends towards this equilibrium. The main interest lies in the coupling between the two systems. The coupling terms from the first order system to the linear second order system are obtained by replacing s by z in Equations 1 and 2. In other words, if z /0 the system (x,y) will take up energy while it will loose energy for z B/0. The coupling term in the first order system is based on the following assumptions: 1) Due to the air flowing through the vocal folds the transglottal pressure difference should decrease when the amplitude of the oscillation increases; and 2) this decrease should become important only for negative values of x. The second condition was introduced to model the difference between the open and closed phases of the vocal fold vibration. Air flows through the vocal folds, and reduces the transglottal air pressure difference, only during the open phase. It was defined arbitrarily that x B/0 corresponds to the open phase. The exact form of the coupling term was determined by trying different candidate functions satisfying above conditions. The term that allowed to produce the desired triple periods was: c(arctan(x) p)(x2 y2 ) where c is the coupling intensity and arctan(x)-p ‘throttles’, depending on x, the rest of the term.

41

Thus, the final system reads as follows: dx=dt dy=dt dz=dt

b(a

z)

zx y x zy

c(arctan(x)

(4) (5) p)(x2

y2 )

(6)

Experiments have shown that the system reacts strongly to short time parameter variations. Figure 3 shows that a simple step input of the parameter c can produce sequences of single, double and triple periods that are very similar to measured Lx data. Two differences between the behaviour of the proposed system and real Lx signals are that our system outputs signals that are close to sinusoids (as opposed to the characteristic wave shape of Lx signals) and that */with the control inputs used here */the system changes the fundamental frequency only in steps (pitch breaks) while real Lx signals show stepwise as well as continuous changes of fundamental frequency. The continuous change of fundamental frequency could be obtained by another parameter, corresponding to the intrinsic frequency. This parameter, even if already present in the model, is not explicitly written out as such (e.g., the frequency is equal to 1). Discussion and conclusions In this work, the existence of pitch breaks in vocal fold vibration between single, double, triple, quadruple and quintuple periods has been demonstrated. Besides the predominant single period, double and triple periods are the most common. A nonlinear dynamical system has been presented which is capable of producing pitch breaks between single, double and triple periods driven by a simple step function as the control parameter. For a single interval of the control parameter all single, double and triple periods exist, as opposed to the classic period doubling cascade where triple periods are separated from double periods by the whole period doubling cascade and a region of chaos. Therefore our system corresponds better to the real vocal folds behaviour under the aspect of triple period occurrences. This model also raised the question if there may be a single mechanism responsible for pitch breaks to frequencies that are higher or lower than the normal vibration frequency. Several observations suggest that the double and triple periods in the human voice are controlled in a similar way as in our model. The constriction modelled in the airflow could be a result, for example, of the action of the false vocal folds and/or articulatory narrowing in the oral tract. The system is particularly sensitive to the timing of parameter changes, which could explain the fact that triple periods appear seemingly randomly within certain sounds, as a

42

F. Menzer et al.

speaker is not controlling his/her muscles with millisecond precision. S/he may be able to enable triple periods voluntarily, but this relies on very precise timing over which it is unlikely that the speaker has control. The basin of attraction for the triple period attractor is smaller than that for the double period attractor, making double periods the more likely outcome; this corresponds to our observations. An interesting observation relating to the proposed system is that the time series for the state variable z (which is related to transglottal air pressure difference) has a dominating component at twice the fundamental frequency. This is due to the nonlinearity (x2 /y2) which is capable of doubling the frequency of sinusoids. As this term is part of the coupling responsible for creating the double and triple periods, this may provide a hint that there could be a single mechanism that can produce both sub- as well as super-harmonic pitch breaks, starting from the frequency of normal vocal fold vibration.

partially funded by a Young Professorship Award to Auke Ijspeert from the Swiss National Science Foundation (A.I. & J.B.).

Note 1. The parameter region is roughly between c /0.4 and c /0.6, and in this region the triple period attractor seems to undergo some bifurcations on its own, which will be discussed in a further paper (see Future work).

References 1. Holmes JN. Speech synthesis. London: Mills and Boon; 1972. 2. Keller E. Fundamentals of speech synthesis and speech recognition. Chichester: John Wiley and Sons; 1994. 3. Fant G. Acoustic theory of speech production. The Hague: Mouton; 1960. 4. Hunt AD. Howard DM, Morrison G, Worsdall J. A real-time interface for a formant speech synthesiser. Logoped Phoniatr Vocol. 2000;25(4):169 /75. 5. Holmes JN. Influence of the glottal waveform on the naturalness of speech from a parallel formant synthesizer. IEEE Trans Audio Electroacoust. 1973;AU-21:298 /305. 6. Eguiluz VM, Ospeck M, Choe Y, Hudspeth AJ, Magnasco MO. Essential Nonlinearities in Hearing. Phys Rev Lett. 2000;84(22):5232 /5. 7. Kern A, Stoop R. Essential Role of Couplings between Hearing Nonlinearities. Phys Rev Lett. 2003;91(12):128101. 8. Abberton ERM, Howard DM, Fourcin AJ. Laryngographic assessment of normal voice: A tutorial. Clin Linguist Phon. 1989;3(3):281 /96. 9. Mende W, Herzel H, Wermke K. Bifurcations and Chaos in Newborn Infant Cries. Phys Lett A. 1990;145:418 /24. 10. Baken RJ. The aged voice: a new hypothesis. Voice. 1994; 3(2):57 /73. 11. Laje R, Gardner T, Mindlin GB. Continuous model for vocal fold oscillations to study the effect of feedback. Phys Rev E Stat Nonlin Soft Matter Phys. 2001;64(5 Pt 2):056201. 12. Ro¨ssler OE. An equation for continuous chaos. Phys Lett. 1976;57A(5):397 /8. /

Future work In order to improve our knowledge of nonlinear phenomena in the vocal folds, further work is necessary. Perturbation studies would reveal how the real vocal folds react to perturbation. Such perturbation studies would also allow better comparisons between the model and real phenomena, thus helping to refine the model. Such a study would involve the ability to perturb the vocal folds in a controlled manner while a subject is speaking, making such an experiment a significant challenge in its own right. This model is particularly interesting from the mathematical point of view. Until now, the complete bifurcation behaviour of the model is unknown (such experiments involve enormous amounts of computation time). Thus, it would be interesting to investigate the model further with a view to characterizing it completely. For example, the triple period attractor seems to undergo some bifurcations independently of the bifurcation cascade responsible for the single and double periods. Our experiments let us suppose that the triple period attractor also has a period-doubling cascade, repeatedly doubling the triple period and ending up at chaos. Yet it remains to be seen what the exact nature of this bifurcation cascade is and for which region of the parameter c the triple period attractor exists. This will be the topic of a future paper. Acknowledgements F.M. gratefully acknowledges support from a grant for exchange studies from the EPFL. This work is

/

/

/

/

/

/

/

/

/

/

/

/

/

/

/

/

/

Appendix A The five sentences that have been read by the phoneticians: ‘In language, infinitely many words can be written with a small set of letters. In arithmetic, infinitely many numbers can be composed from just a few digits, with the help of the symbol zero, the principle of position and the concept of a base. Pure systems with base five and six are said to be very rare. But base twenty occurs in English when we use ‘‘score’’ as in four score and seven. Eventually no system could keep pace with the decimal or arabic number system, which has ten numerals: the digits zero, one, two, three, four, five, six, seven, eight, nine and a decimal point.’