whether the ability to identify a speaker in a second language increases over the ... ability over the eight semesters of a university language course in Swedish at ...
Speaker identification in a second university-level language students
language
by
Kirk P.H. Sullivan and Frank Schlichting, Department of Phonetics, Umeå University.
Abstract In forensic phonetics a witness may be required to identify a speaker based on voice samples from a language which is not their first language. Previous experimental work has shown that knowledge of a language has an effect on an individual’s ability to identify speakers. However, it has recently been demonstrated that this positive effect on speaker identification due to familiarity with the language disappears when most of the linguistic information of the language of the target speaker is removed from the stimulus materials. This paper examines whether the ability to identify a speaker in a second language increases over the course of the British four-year language degree.
Introduction In forensic phonetics the speaker identification technique, the voice line-up, can be used as an important tool in assisting the conviction or the acquittal of an accused person when ‘earwitness’ evidence is available. The voice line-up can be viewed as the aural analogue of the police identification parade. In extreme cases as Clifford (1983) pointed out, the only tangible piece of evidence available to the court might be that of an earwitness. In these cases accurate aural-perceptual voice identification (cf. misidentification) by the earwitness is crucial not only to the success of the prosecution case, but also to the defence of the innocent accused. The earwitness must not positively identify an incorrect voice from the set of different speakers making up the voice line-up. Increasingly, as reported by Köster and Schiller (forthcoming) and Schiller, Köster and Duckworth (forthcoming), both expert witnesses and lay earwitnesses are confronted with speech material which is not in their first language. For the lay earwitness the task they will often be confronted with is a voice line-up consisting of a set of voice samples in a language they do not understand or in which they have a minimal second language competence. The degree of accuracy which could be attributed to their judgements can be questioned. The number of studies examining the impact of a knowledge of a second language upon speaker recognition in the second language is small. Goldstein et al. (1981) concluded that “voice recognition is just as good (or as poor) for foreign voices as it is for native voices”. This finding, however, contrasts with that of Thompson (1987) who found that monolingual English speakers identified English speakers significantly more successfully than they identified Spanish speakers speaking Spanish. Goggin et al. (1991) further contradicted Goldstein et al. (1981). Based on the findings of their study Goggin et al. wrote: “voice identification is increased approximately twofold when the listener understands the language relative to when the message is in a foreign language”. A detailed critique of these studies is to be found in Schiller, Köster and Duckworth (forthcoming). More recently Köster et al. (1995) investigated the level of competence in a language needed to have an effect on recognition abilities. They conducted a test which required recognizing a German voice with which the listeners had been familiarised prior to the test. The test groups were native German, non-German speaking native English and English speakers with a knowledge of German. On the basis of these results the hypothesis that listeners can recognize
129
a speaker more reliably if they have a command of the speaker’s language was formed. However, reduplication of the experiment by Köster and Schiller (forthcoming) for Spanish and Chinese listeners, with and without a knowledge of German, produced less clear results. It is, therefore, unclear, what the correlation between the degree of competence in a language and the performance in a voice line-up is. The situation is further complicated by the findings of Schiller, Köster and Duckworth (forthcoming) that removal of linguistic information, by using nonsense speech in a voice line-up of German natives, results in no significant difference in identification success between the native German speakers, the German speaking native English speakers and the non-German speaking native English speakers. This paper contributes to this field of research by examining whether there is any change in recognition ability over the eight semesters of a university language course in Swedish at a British university. These groups are contrasted with a group with no knowledge of Swedish.
Procedure The experiment used in this study is identical to that used by Schlichting and Sullivan (forthcoming). Schlichting and Sullivan carried out a range of experiments which investigated whether the imitated voice posed a problem for the voice line-up. Experimental participants All the experimental participants had English as their first language. There were four groups of subjects. Group One (16 subjects) were all high school students at North Chadderton High School and had no knowledge of Swedish. Groups Two (14 subjects), Three (9 subjects) and Four (18 subjects) were composite groups of students reading Swedish at University from the universities of London, Surrey and Wales. Group Two were just beginning their second semester of studies, Group Three their fourth semester and Group Four their eighth semester. Speech material The voice line-ups in this study were constructed from a set of ten voices. These were the voice of Carl Bildt, the former Swedish statsminister (Prime Minister), a professional imitation of the voice of Carl Bildt by Göran Gabrielsson, the natural voice of Göran Gabrielsson, three amateur imitations of Carl Bildt, the natural voices of the three imitators and one extra voice. The stimuli presented were och därför tycker jag att det är så underligt, ‘and therefore I find it so strange’ and att där vill miljöpartiet bromsa, ‘that, there the Green Party wants to drag its heals’. The basis for the selection of these stimuli is presented in Schlichting and Sullivan (forthcoming). The speech material used to familiarise the listeners was a 30 second extract from a speech Carl Bildt gave as statsminister to the Riksdag, the Swedish Parliament. A transcript is given in the Appendix to Schlichting and Sullivan (forthcoming). The construction of the line-ups The line-ups comprised six voices. Each voice was separated by a less-than-one-second pause (Roebuck and Wilding, 1993): the pause was around 80 msec in length. Line-ups were constructed with and without Carl Bildt’s voice, with and without the professional imitation of Carl Bildt’s voice, and with and without the natural voice of the professional imitator. Stimuli, randomly selected from the set clipped from the amateur imitations and their natural voice recordings, were used to make all the line-ups six voices in length. Eight different line-up compositions were, thus, created for each of the two stimuli. This made a total of sixteen different line-ups. The order of presentation within the line-up was random.
130
The recognition task The recognition task presented to the participants was identical to Experiment One in Schlichting and Sullivan (forthcoming). The participants heard a one minute recording of the voice they were to identify once before a training block of four line-ups and once before the experimental block of sixteen different line-ups. The participants were told that they would hear a voice which they were to remember and identify in the voice line-ups which followed. They were to register, after hearing all six voices, their choice by circling the number indicating the position of the voice in the line-up sequence and if the voice was not present to mark the not-present option.
Results The results are presented in Tables 1 and 2. In Table 1 the voices of Carl Bildt and the professional imitation are treated as two separate voices. In Table 2 they are treated as a single voice, i.e. the selection of either voice as the one to be identified was considered to be correct. (Schlichting and Sullivan (forthcoming) demonstrated that native speakers of Swedish were in the worse case mislead by the imitation to produce a 100% speaker misindentification rate.) The tables show the percentage of correct and incorrect speaker identifications, along with the percentage of ‘false alarms’ (the identification of a voice when the voice to be identified is absent from the line-up) and ‘hits’ (the correct identification of the voice when present in the line-up). Table 1. Percentage correct, incorrect, false alarms and hits for the four listener groups. The voices of Carl Bildt and the professional imitation of Carl Bildt are considered as two different voices
Group 1 Group 2 Group 3 Group 4
Correct 21 36 38 33
Incorrect 79 64 62 67
False Alarms 76 73 75 70
Hits 19 45 50 38
Table 2. Percentage correct, incorrect, false alarms and hits for the four listener groups. The voices of Carl Bildt and the professional imitation of Carl Bildt are treated as one voice.
Group 1 Group 2 Group 3 Group 4
Correct 22 39 49 43
Incorrect 78 61 51 57
False Alarms 73 78 69 60
Hits 20 38 49 43
Discussion The most striking feature of the data, when the voice of Carl Bildt and the professional imitation of Carl Bildt’s voice are considered separately, is the lack of change in the false alarm rate across the four groups. All four groups are unable to accurately detect the absence of the voice to be recognized from the line-ups. This is problematic if the voice line-up is to be used as a forensic tool in a second language situation. This lack of change contrasts with the improvement in the hit rate when comparing the group with no knowledge of Swedish (Group 1) with the three groups with a knowledge of Swedish (Groups 2 to 4). However, there is no increase after the initial improvement. There is no improvement in hit rate over the four years of university Swedish study. This initial increase and the later stability in the hit rate, is
131
reflected in the overall correct and incorrect values. That is, there is only a difference between those with no knowledge of Swedish and those who are studying Swedish at University. There is an apparent immediate gain when beginning Swedish studies, but no gain thereafter over the period of the degree course. The picture, however, is slightly different when the voices of Carl Bildt and the professional imitation of Carl Bildt are treated as a single voice. There is a continuous reduction in false alarm rate from Group 2 to Group 4. Groups 1 and 2 report similar values, which are not different to their false alarm rates when the two voices were treated separately. This could be interpreted as the learners improving their ability to identify and discriminate between the voices of the line-up. However, the hit rate does not reduplicate this trend; Group 4’s hit rate differs little to that of Group 2’s. Yet, the improvement from no knowledge of Swedish to some knowledge is repeated. This is also reflected in the overall correct and incorrect scores. It can be concluded that although the stimuli in the line-up were only circa 2 seconds long, there was enough linguistic information available for a distinction to be drawn between those with no knowledge of Swedish and those with some knowledge of Swedish. No unambiguous improvement in speaker identification ability over the duration of the four year degree programmes was detected. Acknowledgements We thank the staff and students of North Chadderton High School, Oldham, those lecturing and studying Swedish at University College London, the University of Surrey, Guildford and the University of Wales, Lampeter for the cheerful and co-operative manner in which they participated in this experiment. A particular thanks is given to those who helped produce the speech materials without which this experiment could not have been undertaken. References Clifford B.R. 1983. Memory for voices: The feasibility and quality of earwitness evidence. In Lloyd-Bostock B.R. and Clifford B.R. (eds) Evaluating Witness Evidence, 189–218. Chichester: Wiley. Goggin J.P., Thomspon C.P., Strube G. and Simental L.R. 1991. The role of language familiarity in voice identification. Memory and Cogniton, 19, 448–458. Goldstein A.G., Knight P., Bailis K. and Conover J. 1981. Recognition memory for accented and unaccented voices. Bulletin of the Psychonomic Society, 17, 217–20. Köster O. and Schiller N.O. Forthcoming. Different influences of the native language of a listener on speaker recognition. Forensic Linguistics: The International Journal of Speech, Language and the Law. Köster O., Schiller N.O. and Künzel H.J. 1995. The influence of native-language background on speaker recognition. In Elenius K. and Branderud P. (eds) Proceedings of the Thirteenth International Congress of Phonetic Sciences, Stockholm, Sweden, 4, 306–309. Roebuck R. and Wilding J. 1993. Effects of vowel variety and sample length on identification of a speaker in a line-up. Applied Cognitive Psychology, 7, 475–481. Schiller N.O., Köster O. and Duckworth M. Forthcoming. The effect of removing linguistic information upon identifying speakers of a foreign language. Forensic Linguistics: The International Journal of Speech, Language and the Law. Schlichting F. and Sullivan K.P.H. Forthcoming. The imitated voice — a problem for voice line-ups? Forensic Linguistics: The International Journal of Speech, Language and the Law. Thompson C.P. 1987. A language effect in voice identification. Applied Cognitive Psychology, 1, 121–131.
132