The Influence of Vowel Quality on Stream Segregation of Synthetic Vowels
Using diphone synthesis, the Danish vowels [i:], [e:],
Vowel Quality
[y:], [u:] and [ɑ:] were synthesized with different
We distinguish between the produced vowel quality and the perceived vowel quality. The produced vowel quality is a continuum in the possible vowel space and can be measured acoustically. The perceived vowel quality is categorical, influenced by the phonology of each language, and therefore differs between listeners with different mother tongues. Figure 2 shows the five Danish vowels used in the experiment in a 3D version of the traditional auditive/ articulatory vowel diagram, as well as the approximate relation to the three lowest formants in the vowels’ frequency spectrum. Though the vowels represent five different (meaning-changing) phonemes in Danish, the distances between them are not uniform. The high front vowels are more tightly spaced, and often non-native speakers of Danish have difficulties both producing (pronouncing) and perceiving (hearing) them correctly [5], [4].
f0 values and used in an auditory stream segregation experiment investigating the perception of native speakers of Danish. The vowels were presented in a galloping tones arrangement and the f0 streaming
Results For the data processing the answer “one stream” was assigned the value 0, the answer “two streams” the value 1, and the answer “in doubt or can hear both” the value 0.5. All responses were analysed using a 4-way mixed-model ANOVA with listener as a random effect. Significant main effects of vowel, f0 and listener were found (p < 10-5). Tukey HSD post-hoc tests showed all vowels to be significantly different, and f0 were in general significantly different when they were more than one step apart. The significant interaction f0*vowel (p < 10-5) shows that the streaming threshold is different across the five vowels. [i: - [i:] i: - i:] 1
F2
0
threshold for the vowels relative to [i:] was measured.
0
Response Response
F3
1
University of Copenhagen, Department
Galloping Tones and Vowels When two tones A and B are presented in a galloping arrangement as shown in figure 1, they are perceived as one stream when the difference in fundamental frequency is small (upper panel) and as two streams when the frequency difference is large (lower panel). The streaming threshold is the difference in fundamental frequency at which the perception splits up into two streams. When tones are substituted with vowels, streaming is influenced not only by the difference in fundamental frequency, but also by difference in spectral distribution – the primary acoustic cue to the perceived vowel quality [1], [6].
of Nordic Studies and Linguistics 2 3
Scan to access poster pdf
Contact Lars Bramsløw
[email protected]
Presented at the International Hearing Aid Conference (IHCON) Lake Tahoe, CA, USA, August 10-14, 2016
Figure 1. Two tones A and B in a galloping rhythm are perceived as one stream when the frequency separation is small and as two streams when the frequency separation is large. v is the tone or vowel duration and t1 and t2 are the triplet time constants.
Diphone Synthesis A diphone is an interval of speech from the middle of a phone (or silence) to the middle of the successive phone (or silence). Diphone synthesis is made from diphones extracted from real speech, often from a speakers’ reading of nonsense syllables. An advantage of diphone synthesis is that it preserves the transitions between the connected phones [7].
15.8
O
0.0
2.6
5.3
7.9
10.5
13.2
15.8
10.5
13.2
15.8
10.5
13.2
15.8
5.3 7.9 10.5 Semitones re 87 Hz
13.2
15.8
[y: -[y:] i: - y:] 7.0 O
0.0
2.6
5.3
7.9
Figure 4 shows the streaming thresholds for [i:], [e:] and [y:], the three vowels for which it was possible to measure a streaming threshold.
[u: - [u:] i: - u:]
0.0
2.6
5.3
7.9
[ɑ: - [α:] i: - ɑ:]
0.5 0
F1
Figure 2. The five Danish vowels used in the experiment in a 3D version of the traditional auditive/ articulatory vowel diagram. The green axes show the approximate relation to the formant values.
Research Question and Hypotheses Question: What is the effect of (produced) vowel quality on f0-based streaming thresholds between the Danish vowels [i:], [e:], [y:], [u:] and [ɑ:] for native Danish-speaking listeners. Hypothesis 1: All streaming thresholds are the same because the five vowels represent five different phonemes in Danish. This implies a language specific linguistic processing in the brain. Hypothesis 2: Streaming thresholds correlate with the acoustic distance of the vowels, expressed as the Euclidian distance based on F1, F2 and F3 (independent of f0). This implies a psychoacoustic processing of the vowels similar to that of other tone complexes.
Using Festival [2], the stimuli were made with diphone synthesis based on real speech from a native speaker of Danish. The vowel formants of the synthetic speech were measured and confirmed to approximately match the distribution shown in figure 2.
Center for Applied Speech Technology
0
13.2
12.7
1
Synthesized Vowel Stimuli
Copenhagen Business School, Danish
10.5
[e: - [e:] i: - e:]
0.5
0
Experiment
Eriksholm Research Centre
7.9
A psycometric function was fitted to the data as seen in figure 3, and the streaming threshold defined as the midpoint of the psychometric function. As the curve never crosses the midpoint for [u:] and [ɑ:], we have to conclude that it was not possible to measure a streaming threshold for these two vowels.
0.5
Purpose and Scope
Morten Eigil Holm 2 Lars Bramsløw 3 Peter Juel Henrichsen
5.3
1
the vowels’ acoustic distance.
1
2.6
1
hypothesis that streaming thresholds correlate with
Psychoacoustic and Phonetic Background
0.0
0.5
vowels in Danish was found, but results support the
Authors
O
1
No influence from the phonological status of the
Isolated phones, such as vowels, are relevant for obtaining detailed information on the effects of hearing aid signal processing on speech perception. Synthetic speech is attractive because specific parameters, such as fundamental frequency, can be manipulated directly. In this study employing theory and method from psychoacoustics, phonetics and speech technology, we investigated how the vowel quality and fundamental frequency of synthesized vowels affected stream segregation for normal-hearing listeners.
13.8
0.5
0.0
2.6
Semitones re 87 Hz
Figure 3. Streaming response as a function of f0 difference. Fitted to a psychometric function.
Discussion
Figure 4. Streaming thresholds for [i:], [e:] and [y:].
Figure 5 shows the streaming threshold plotted against the Euclidian distance of the vowels in the ABA sequence. The regression line shows the correlation between streaming threshold and acoustic distance for the three vowels that gave measurable streaming thresholds. Although the fit is good, it should not be seen as more than an indication as it is based on just three points. A possible explanation of why it was not possible to measure a streaming threshold for [u:] and [ɑ:] can also be seen from figure 6: if the regression line for the three vowels with measurable streaming thresholds was to be extrapolated to reach the Euclidian distance for [u:] or [ɑ:] on the x-axis, it would result in a negative streaming threshold, which is meaningless. This could be because it is not possible to measure streaming thresholds in the full vowel space using ABA sequences with just one vowel duration and one set of triplet time constants (see figure 1).
Figure 5. Streaming threshold plotted against Euclidian distance from [i:]. The Euclidian distances for [u:] and [ɑ:] are indicated with an arrow.
Listeners
Conclusions
The listeners were 12 normal-hearing native speakers of Danish. Before the experiment, recordings of each listener reading Danish sentences containing the five vowels were made, and from measurements of the vowel formants it was confirmed that all listeners (as expected) themselves did produce vowel qualities corresponding to those in figure 2.
Hypothesis 1, streaming threshold is influenced by phonological status, must be rejected due to the different streaming thresholds across vowels. This could be because vowels in isolation are not perceived as language. To investigate this further, syntheses of larger linguistic units, e.g. syllables or whole words, could be used as stimuli.
Method
Hypothesis 2, streaming threshold correlates with acoustic distance, is supported by the results, but it is based on just three data points. It was not possible to design stimuli that gave measurable streaming thresholds in the full vowel space, and an experiment focusing on just the front vowels, e.g. [i y ɛ ø a ɶ], could investigate this further.
The vowels [i:], [e:], [y:], [u:] and [ɑ:] were synthesized with a duration of 150 ms and arranged in galloping ABA sequences with the triplet time constants t1= 275 ms and t2 = 1100 ms (see figure 1). B was always [i:] at a fundamental frequency of 87 Hz. A was one of the five vowels in 2.6 semitone steps relative to 87 Hz as seen in table 1, this spacing ensured Table 1. f0 values in ABA sequences. inharmonic intervals. The resulting 35 sequences were presented diotically on headphones. For each sequence the listeners reported if they perceived one stream, two streams, or, as a third possibility, if they were in doubt or could hear both.
The large f0 span needed in this experiment made many of the stimuli sound unnatural, and a better synthetic voice without this limitation should be designed for future research. This could be done by using real speech tokens at more than one f0 value for the building of the speech synthesis.
REFERENCES [1] Bregman, A.S., 1990. Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge: MIT Press. pp. 86ff and 529ff. [2] Festival, 2015. The Festival Speech Synthesis System. Available at: http://www.cstr.ed.ac.uk/projects/festival/. [3] Gaudrain, E. et al., 2008. Streaming of vowel sequences based on fundamental frequency in a cochlear-implant simulations. The Journal of the Acoustical Society of America, 124(5), pp. 3076–3087. [4] Grønnum, N., 1998. Illustrations of the IPA: Danish. Journal of the International Phonetic Association, 28(1-2), pp. 99-105. [5] Grønnum, N., 2005. Fonetik og Fonologi 3. udgave., København: Akademisk Forlag. pp. 56, 107 and 207ff. [6] Moore, B.C.J. & Gockel, H., 2002. Factors Influencing Sequential Stream Segregation. Acta Acustica united with Acustica, 88(3), pp. 320-333. [7] Schroeter, J., 2008. Basic Principles of Speech Synthesis. In J. Benesty, M. M. Sondhi, & Y. Huang, eds. Springer Handbook of Speech Processing. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 413-428.