stimuli because of the uniqueness of auditory characteristics. ... the threat of replaying PAI from the stolen or scanned biometrics should grow in the future. By the ...
User Authentication Scheme Using Individual Auditory Pop-Out Kotaro Sonoda and Osamu Takizawa Information Security Research Center, National Institutes of Information Communications and Technology, 4–2–1 Nukui-Kita, Koganei, Tokyo, 184-8795, Japan
Abstract. This paper presents a user authentication scheme which takes advantage of the uniqueness of prover’s auditory characteristics. In this scheme, several audio stimuli are presented at the same time by headphone. The stimuli include a special audio stimulus that only the genuine prover can easily distinguishes from other stimuli because of the uniqueness of auditory characteristics. The prover is made to answer the contents of the special stimulus. The verifier confirms the correct answer as the genuine prover. As the special audio stimulus that distinguishes the genuiness, auto-phonic production is examined in this paper. The advantage of this scheme is that the prover does not need to keep complex sequence in minds like a password authentication. Moreover, the Personal Authentication Information (PAI) is never stolen, because the PAI in this scheme is personal auditory memory or receptor.
1 Introduction With growing number of the service for individual customers, people have to manage a lot of password and keep it complex and fresh on conventional password authentication. The issue of the password management can be forgotten at biometrics authentication. The Personal Authentication Information (PAI) of the biometrics authentication is the prover’s unique biometrics that the prover’s imposter can’t get easy. However, in the biometrics authentication, the prover has expose own biometrics anytime. Therefore, it is possible that these biometricses are unconsciously scanned even if the owner doesn’t have the intention to be authenticated. Whether, at this time, their authentications have necessity to make the body parts closer to the scanner because of the scanning resolution, the threat of replaying PAI from the stolen or scanned biometrics should grow in the future. By the way, how to recognize the prover in case of interpersonal contact? Following three strategies are considered: 1. ask the other the sharing secret word. 2. observe the other’s face, voice, or movements. 3. talk some sharing stories, then test the reaction. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 341–349, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
342
K. Sonoda and O. Takizawa
The first and second strategy correspond with password and biometrics authentication respectively. Human reflex response authentication and mnemonic authentication corresponds with the third strategy. In this authentication, the verifier gives some stimuli which are shared with the genuine prover and are able to be recognized only by the genuine prover, then requires the prover’s responses. The correct responses prove the genuiness. CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Human Apart)[1] Gimpy[2] can be categorized in the above mentioned third strategy although the system distinguishes the human from the machine. Gimpy presents the prover a sequence of distorted fonts which distracted by noise. If the prover is real human, the characters are recognized correctly while computers posing human cannot. Nishigaki et al. proposed a reflex-response-based user authentication scheme using blind spot and saccade response time[3]. The verifier presents the prover a view target on the screen several times. If the view target is presented inside of the genuine prover, any saccades shouldn’t be observed from the response of the genuine prover while it should be observed from the spoof provers. In this authentication scheme, the verifier distinguishes the genuiness of prover by measuring such a difference of existence of saccade. We proposed a response-based user authentication scheme using individual difference of auditory pop-out. In this scheme, several audio stimuli are presented at the same time by headphone. The stimuli should include a special audio stimulus that only the genuine prover can easily distinguishes from other stimuli because the stimulus were heard as poped-out from others by the genuine prover while the stimulus were melted into others by the spoof prover. The prover is made to answer the contents of the special stimulus. The verifier confirms the correct answer as the genuine prover. As such a special audio stimulus by which verifier confirms the genuiness, auto-phonic production is examined in this paper.
2 Authentication Protocol The proposed authentication protocol is followed (Fig.1): Registration: The prover (P ) who want to be authenticated submits the set of stimuli (X) which can be receipt only by himself to the verifier (V ). Authentication: V gives the stimuli s to unknown prover (P ). The s are made by mixing a stimulus xi (∈ X) and several dummy stimuli di . . . . 1. V requires P to answer the content of the stimulus xi . 2. P answers the content. (If P is P , he can answer correctly.) 3. Above 1 and 2 are repeated. Verification: V verify the P as P if the responses meet the necessary requirements. At the authentication phase, the audio stimuli are presented by headphone and located virtual in several direction by the convolution with the general head related transfer functions (HRTFs) of each direction.
User Authentication Scheme Using Individual Auditory Pop-Out
343
(a) Registration
(b) Authentication Fig. 1. Reflex-response-based user authentication scheme based on the auditory characteristics
344
K. Sonoda and O. Takizawa
3 Auto-phonic Production Auto-phonic production speech is the speech sound which is recognized by talker himself while he is speaking. As shown in Fig. 2, when the people speak, the voices are reached to the strangers by transmitting through the air. However, the voice principal heard the voice by both paths of the air and the bones or body. The voice which the stranger recognizes as the principal should be different from the auto-phonic production voice. On the other hand, the autophonic production voice should be the principals unique self-voice. Therefore, it is expected that the auto-phonic production voice must yield the difference of responses between the principal and the strangers. Nakayama et al. studied the
Stranger
Principal Autophonic-procuction
Air-conducted
Fig. 2. Auto-phonic production
characteristics of the auto-phonic production by making talker to control the amplitudes for several frequency components of the air-conducted voice which is recorded at near the talker’s mouth in order to close the heard air-conducted voice with the auto-phonic production[4]. As the result, it was found that the vocal sound was perceived relatively louder in the low frequency region (about 5 dB at 100 Hz) and softer in the high frequency region (about −5 dB at 4 kHz) than the air-conducted voice. 3.1
Experiment I: Familiarity with Auto-phonic Production
First of all, to confirm the differences of the voice that was imagined as the talker between voice principal and strangers, we carried out an auditory experiment. In this experiment, the subject listeners are ordered to evaluate in five grades the distance from the voice which they have imagined the talker to the air-conducted voice (O), auto-phonic production (A), low-boosted voice (L), and high-boosted voice (H), respectively. These stimuli were generated by the appling function of frequency to the recorded air-conducted voice. The functions of frequency applied to stimuli A, L, H are shown in Figure 3. In case of the listener is the talker (principal), the voice usually listened should be auto-phonic production, although in otherwise (strangers) the voice usually listened should
Amplifier level [dB]
User Authentication Scheme Using Individual Auditory Pop-Out
345
Stimulus H: High-boosted
+10 +5 4000 100
frequency in logarithmic scale [Hz]
-5
Stimulus A: Auto-phonic production
-10
Stimulus L: Low-boosted
Fig. 3. Functions of frequency manipulated for recorded air-conducted voice O
(a) V.S. voice of talker 1
(b) V.S. voice of talker 2
(c) V.S. voice of talker 3 Fig. 4. Familiarity on difference grades from air-conducted voice
be air-conducted voice. The words of stimuli are the Japanese four mora words cited from the list which has balanced phonemes and high familiarities[5]. The results are showed in Fig. 4 in difference grades from air-conducted voice (O). Great scores in them indicate that the target stimulus was more familiar than
346
K. Sonoda and O. Takizawa
air conducted voice and one or greater difference of scores between target stimuli indicates that those stimuli can be discriminated. Figure 4 (a) shows that subject listener 1, the voice principal (talker), can distinguish the auto-phonic production from the other stimuli, although the other listeners, the voice strangers, grade similar scores to auto-phonic production and high-boosted voice. Therefore, the auto-phonic production is expected to be available for authorizing subject.
4 Auditory Search on Simultaneous Multi Audio Stimulus In our authentication scheme, the presented auto-phonic production stimulus has to be enough distracted by simultaneously presented other stimuli. Human ability of recognition against simultaneous presented speeches are studied by several researchers. Kashino et al. examed the multi-talker recognition test[6]. He tested the number of the talkers and the number which the subject can recognize the contents of talk by using simultaneous presented Japanese four moras words. He reported as followed: Recognition of number of talkers In the case of the stimuli constructed from up to two talkers, it was possible to judge the number of talkers almost completely. In the case of the stimuli constructed more than three talkers, however, the tendency to underestimate the number of talkers was seen. Recognition of number of words It was possible to answer only up to almost two. It didn’t depend on the composition of the number of talkers. Bronkhorst et al. experimented on the word intelligibility and the talker recognition of the target voice in cases of various conditions of presentation method; monaural, binaural, and three-dimentional audiotory presentation[7]. He concluded as followed: 1. There is no difference in performance between a 3D auditory display based on individualized HRTFs and general HRTFs. This conclusion applies to all scores assessed for speech intelligibility, talker recognition (including the time required for recognition), and talker localization. This means that no individual adaptation of a band-limited (4 − kHz) communication system is needed in a practical application of an auditory display with many users. 2. Compared to conventional monaural and binaural presentation, 3D presentation yields better speech intelligibility with two or more competing talkers, in particular for sentence intelligibility. Equivalent performance is achieved with 3D presentation compared to binaural presentation when one talker is added and compared to monaural presentation when two or three talkers are added. However, in specific conditions (all competing talkers on the side opposite the target talker) binaural presentation may be superior to 3D. Within the 3D configurations examined, intelligibility is highest when the target talker is at −45 ◦ or 45 ◦ azimuth.
User Authentication Scheme Using Individual Auditory Pop-Out
347
3. Talker-recognition scores are higher for 3D than for monaural and binaural presentation, but the differences are small. Recognition scores depend less on the number of competing talkers than intelligibility scores. The virtual positions of the talkers in 3D are not a relevant factor. 4. For binaural and 3D presentation, the time required to correctly recognize a talker increases with the number of competing talkers. For two or more competing talkers, 3D presentation requires significantly less time than binaural presentation. 5. Absolute localization of a talker is relatively poor and becomes gradually more difficult as the number of competing talkers increases. These studies are about in multi-talker condition, but the condition in our scheme uses single-talker. The target stimulus in our scheme should be distracted easier than multi-talker condition. We expected four or five locations as the number that the individual difference in the response occur. 4.1
Experiment II: Mixed Multi-stimuli Task
To confirm the existence of the difference of the ability to find the auto-phonic production distracted by multiple audio stimuli between the talker of them and not talker, an auditory experiment were carried out. In the experiment, subjects were tasked to find the concern talker’s auto-phonic production distracted by multiple air conducted sounds of the talker. The subjects are assigned genuine provers when the stimuli are his speeches and are assigned spoof provers when the stimuli aren’t. The numbers of direction or stimulus in experimented conditions were four and five; one of them is a auto-phonic production. This number wasn’t told the subjects. In this experiment, we improved the method of generating the auto-phonic production from experiment I mentioned section 3.1. Considering the propagation path of the auto-phonic production, the hearing of bone- or muscleconducted component isn’t affected by mouth aperture while the air-conducted component is affected. Therefore, simple tuning on the frequency responses might have big variance by the phonic feature of the sentence. In the below experiment, therefore, auto-phonic production stimulus was generated by mixing the air- and muscle-conducted voice in the proportion that talker defined to make close with hearing as auto-phonic production by each sentence as shown in Figure 5 These measured proportions had many varieties by sentences and talkers in deed. The source stimuli are the Japanese words of four-moras which are picked up from Japanese word familiarity database. We selected 25 words from the group categorized in most familiar. Two Japanese young men attended the experiment. They are acquaintances each other. The directions which stimulus presents from are −90, −45, 0, +45, and +90 degrees in five stimuli condition and −90, −30, +30, and +90 degrees in four stimuli condition as shown in Fig.6. The amplitudes of all the presented stimuli were normalized by equalizing standard variations of them. Table 1 shows the answer rates of correct, wrong, and not-presented word in five stimuli condition
348
K. Sonoda and O. Takizawa 1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 00000000 11111111 0000 1111 00000000 11111111 0000 1111 00000000 11111111 0000 1111 00000000 11111111 00000000 11111111 Autophonic-procuction 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111
body-conducted
111 000 000 111 000 111
Air-conducted
1111 0000 0000 1111 0000 1111
α
Fig. 5. Generation of auto-phonic production stimuli; α: amplitude of air-conducted component on mixing with body-conducted component to auto-phonic production
00 11 11 00 00 11 00 11
111 000 000 111 000 111 000 111 000 111
45 degree
00 11 11 00 00 11 00 11
Autophonic
11 00 00 11 00 11 00 11 00 11
60 degree
Air-conducted
Autophonic Air-conducted
five stimuli condition
four stimuli condition
Fig. 6. Presented directions of stimuli (located virtually)
Table 1. Answer rates of correct, wrong, not-presented words Five Directions Four Directions Talker Listener Correct Wrong Not-presented Correct Wrong Not-presented Genuine prover
01 02
01 02
4.0 8.0 6.0
66.0 74.0 70.0
30.0 18.0 24.0
16.7 10.4 13.6
75.0 64.6 69.8
8.3 25.0 16.7
02 01
16.0 6.0 11.0
40.0 60.0 50.0
44.0 34.0 39.0
25.0 6.3 15.6
47.9 77.0 62.5
27.1 16.7 21.9
total Spoof prover
01 02 total
User Authentication Scheme Using Individual Auditory Pop-Out
349
and four stimuli condition. Unexpectedly, both genuine prover (listener equals talker) and spoof prover (listener spoofs talker) didn’t have the ability to find the target sentence from the other distracting sentences. Their correct answer rates were under the chance rates (25.0 % and 20.0 % in four and five stimuli, respectively). Moreover, the differences of the answer rates between genuine and spoof prover were not showed at this time. We should study the fewer stimuli conditions and the other combinations of stimuli.
5 Conclusion In this paper, we propose an new approach for user authentication scheme using distinctive auditory characteristics. This scheme takes advantage that human might have the individual sensitivity to a certain kind of audio signal stimulus like the individual biometrics. Auto-phonic production speech was adpted as such an inducing stimulus in this paper. From the first experiment, the differences of sensitivity to auto-phonic production voice and air-conducted voice between the voice principal and the strangers were indicated. In the second experiment, we impremented the prototype of authentication system using our scheme. By this time, the availability was not indicated from the experiment. As the future works, we are planning to measure the responce times to ascertain the target audio stimulus from the distracting stimuli. It is expected that the asymmetries between the genuine prover and the imposters exist.
Acknowledgement This Study owes partially to financial support by the Grant-in-Aid for Young Scientists (B) (#19700123) from the Japanese Ministry of Education and Science.
References 1. von Ahn, L., Blum, M., Hopper, N., Langford, J.: CAPTCHA: Using Hard AI Problems for Security. In: Biham, E. (ed.) EUROCRYPT 2003. LNCS, vol. 2656, pp. 294–311. Springer, Heidelberg (2003) 2. CAPTCHA project: http://www.captcha.net/chaptchas/gimpy/ 3. Nishigaki, M., Arai, D.: A User Authentication Using Human Reflex. Transactions of Information Processing Society of Japan 47(8), 2582–2593 (2006) 4. Nakayama, I.: Voice timbre in autophonic production compared with that in extraphonic production. Journal of Acoustics Society of Japan (E) 18(2), 67–71 (1997) 5. Sakamoto, S., Suzuki, Y., Amano, S., Ozawa, K., Kondo, K., Sone, T.: New lists for word intelligibility test based on word familiarity and phonetic balance. Journal of Acoustical Society of Japan 54(12), 842–849 (1998) 6. Kashino, M., Hirahara, T.: Judging the number of concurrent talkers for sentence stimuli. Proceeding of Spring meeting of Acoustical Society of Japan III–3–3 (1997) (in Japanese) 7. Drullman, R., Bronkhorst, A.W.: Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation. Journal of Acoustical Society of America 107(4), 2224–2235 (2000)