Speaker Identification using Power Distribution in ... - Semantic Scholar

1 downloads 48 Views 331KB Size Report
MPSTME, NMIMS, Mumbai - 400056 ..... (System Identification) from IIT Bombay in. 1970. He has ... Engineering, SVKM's NMIMS University, Vile. Parle(w) ...
Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010

Speaker Identification using Power Distribution in Frequency Spectrum Dr. H. B. Kekre1, Vaishali Kulkarni2 Senior Professor, Department of Computer Engineering 2 Assistant Professor, Department of Electronics MPSTME, NMIMS, Mumbai - 400056 Email: [email protected], [email protected] 1

Abstract This paper presents a brief overview of the Speaker recognition process, its trends and applications. Further a simple technique based on the Euclidean distance comparison is proposed. The technique is applied for both text-dependent as well as text independent identification. Text dependent identification gives excellent results whereas text independent identification gives almost 80% matching accuracy. Keywords: Speaker recognition, speaker Identification, Speaker Verification, Spectrogram, Euclidean distance Introduction Human beings can identify a speaker based on his voice with a fairly good precision. With a large number of applications like voice dialing, phone banking, teleshopping, database access services, information services, voice mail, security systems and remote access to computers etc., the automated systems need to perform as well or even better, than humans. A lot of work in this regard has been done. But still there is lack of understanding of the characteristics of the speech signal that can uniquely identify a speaker. The speech signal gives various levels of information. Firstly it conveys the words or message being spoken; also on a secondary level it gives us information about the identity of the speaker. The goal of speaker recognition is to extract the identity of the person speaking. Speaker Recognition is the process of automatically recognizing who is speaking on the basis of individual information included in speech signals. It can be divided into Speaker Identification and Speaker Verification. Speaker identification determines which registered speaker provides a given utterance from amongst a set of known speakers (also known as closed set identification). Speaker verification accepts or rejects the identity claim of a speaker (also known as open set identification). Speaker identification task can be further classified into text-dependent or textindependent task. In the former case, the utterance presented to the system is known beforehand.

Fig.1. Classification of Speaker Recognition systems. In the latter case, no assumption about the text being spoken is made, but the system must model the general underlying properties of the speaker’s vocal spectrum. In general, textdependent systems are more accurate, since both the content and voice can be compared. Work on automatic Speaker recognition started in the 1960’s. Pruzansky at Bell Labs [1] was among the first to initiate the research using filter banks and correlating the two digital spectrograms for a similarity measure. Doddington at Texas instruments [2] replaced filter banks by formant analysis. For text independent methods, various parameters were extracted by averaging over a long enough duration or by extracting statistical or predictive parameters like averaged auto-correlation [3], instantaneous spectra covariance matrix [4],

43

Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010

spectrum and fundamental frequency histograms [5], linear prediction coefficients [6] and long term averaged spectra [7]. As the performance of text-independent systems was limited various text-dependent methods, [8, 9] were also implanted in the 1970’s. Hidden Markov Model (HMM) and Vector Quantization based methods were developed in the 1980’s. Text dependent speaker recognition systems based on HMM architecture generally used multi word phrases for the training phase and stored the models for the entire phrase [10]. VQ/HMM based method was developed for text-independent identification. A set of short-time training feature vectors of a speaker can be efficiently compressed to a small set of representative points, a VQ codebook [11, 12 and 21]. Rose et al. [13] proposed a single-state HMM, which is now called Gaussian mixture model (GMM), as a robust parametric model. In the 1990’s robust text-prompted methods were developed. Matsui et al. [14] proposed a text-prompted speaker recognition method, in which key sentences are completely changed every time the system is used. For reducing the intra-speaker variation, likelihood ratio and posteriori probability-based techniques were investigated [15, 16, and 17]. Methods based on score normalization have been recently introduced in the 2000’s [18].Various high level features like word idiolect; pronunciations, phone usage, prosody, etc. have been successfully used in textindependent speaker verification [19]. Recognition systems have been developed for a wide range of applications. Although many new techniques were invented and developed, there are still a number of practical limitations because of which widespread deployment of applications and services is not possible. Still it is very true that humans can recognize speech and speaker more efficiently than machines [20]. There is now an increasing interest in finding ways to reduce this performance gap.

Fig 2 Speaker identification system the properties of speech that can separate different speakers. Front-end processing is performed both in training- and recognition phases. Speaker modeling - this part performs a reduction of feature data by modeling the distributions of the feature vectors. Speaker database - the speaker models are stored here. Decision logic - makes the final decision about the identity of the speaker by comparing unknown feature vectors to all models in the database and selecting the best matching model. Basics of Speech signal The speech samples used in this work are recorded using Sound Forge 4.5. The sampling frequency is 8000 Hz (8 bit, mono PCM samples). Table 1 shows the database description. All the samples are scaled to the same time scale. The samples are collected from different speakers. Samples are taken from each speaker in two sessions so that training model and testing data can be created. Also 4 different texts are recorded so that both text-independent and text-dependent speaker identification can be done. Fig 3 shows the speech signal for sample 1. Fig 4 shows the different features of the sample 1 speech signal like spectrogram, intensity, pitch and formants.

The recognition Process The Speaker identification system is composed of the following modules: Front-end processing - the "signal processing" part, which converts the sampled speech signal into set of feature vectors, which characterize

44

Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010

Table 1 Database Description processed and matched with the feature vectors stored in the database. The stored feature vector which gives the minimum Euclidean distance with the input sample feature vector is declared as the match (speaker identified). Fig 5 (a) shows the FFT of the sample 1 shown in Fig 3. Fig 5 (b) shows the feature vector for sample 1obtained by grouping 250 samples and selecting first 16 as feature vectors exploiting the symmetry of DFT.

Parameter

Sample characteristics Language English No. of Speakers 26 Speech type Read speech Recording Normal. (A silent conditions room) Sampling frequency 8000 Hz Resolution 8 bps Training speech 6 sec, 8 sec (scaled) Evaluation speech 6 sec, 8 sec (scaled)

Fig 3 Speech signal for the sample1

Fig 5 (a) FFT of sample 1 (b) sum of the magnitudes of FFT for 16 feature vectors

Experimental For creating the database, all the time scaled samples of either 6 sec or 8 sec are considered. The FFT of the samples is found and the sum of the magnitude of FFT for different groupings is found. This formed the feature vector. The feature vectors were stored in the database. For the identification of the speaker, the input sample is similarly

Results and Discussions In this section the results obtained by applying the technique discussed in the previous section on a sample set of 26 speakers of varying age groups (between 10 and 70 years of age) is presented for both text-dependent and textindependent speaker identification. Fig 6 shows the curves obtained for textdependent as well as text-independent identification by varying the number of feature vectors for a sample set of 26 speakers. As seen from the figure, for text-dependent samples, about 80 feature vectors are sufficient to get 100% accuracy. But with text-independent speech, the maximum accuracy is about 84.61%. Fig 7 shows the curve obtained for textindependent identification by varying the number of speakers in the database. It can be seen that as the number of speakers increases the accuracy decreases as expected but it is still above 80%.

Fig 4 Various parameters of sample 1. Spectrogram, pitch, intensity, formants

45

Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010

[5]B. Beek, et. al., “Automatic speaker recognition system”, Rome Air Development Center Report, 1971. [6] M.R.Sambur, Speaker recognition and verification using linear prediction analysis, Ph. D. Dissert., M.I.T., 1972. [7] S. Furui, et. al., “Talker recognition by long time averaged speech spectrum”, Electronics and Communications in Japan, 55-A. pp. 54-61, 1972. [8]S. Furui, “Cepstral analysis technique for automatic speaker verification”, IEEE Trans. Acoustic, Speech, Signal Processing, ASSP-29, pp. 254-272, 1981. [9]A. E. Rosenberg and M. R. Sambur, “New Techniques for automatic speaker verification”, IEEE Trans. Acoustics, Speech, Signal Proc., ASSP-23, 2, pp. 169-176, 1975. [10] J. M. Naik, et. al., “Speaker verification over long distance telephone lines”, Proc. ICASSP, pp.524-527, 1989. [11]F. K. Soong, et. al., “A vector quantization approach to speaker recognition”, At & T Technical Journal, 66, pp. 14-26, 1987. [12] A. E. Rosenberg and F. K. Soong, “Evaluation of a vector quantization talker recognition system in text independent and text dependent models”, Computer Speech and Language 22, pp. 143-157, 1987. [13] R. Rose and R. A. Reynolds, “Text independent speaker identification using automatic acoustic segmentation”, Proc. ICASSP, pp. 293-296, 1990. [14]T. Matsui and S. Furui, “Concatenated phoneme models for text variable speaker recognition”, Proc. ICASSP, pp. II-391-394, 1993. [15] A. Higgins, et. al., “Speaker verification using randomized phrase prompting”, Digital Signal Processing, 1, pp. 89-106, 1991. [16]T. Matsui and S. Furui, “Similarity normalization method for speaker verification based on a posteriori probability”, Proc. ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 59-62, 1994. [17] D. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models”, Proc. ESCA Workshop on Automatic Speaker recognition, Identification and verification, pp. 27-30, 1994.

Accuracy %

matching percentage 100 80 60 40 20 0

text dependent text independent

16

20

40

80

200

400

no. of feature vectors

Fig 6 Variation in the number of feature vectors

Accuracy (%)

Variation in no. of speakers 120 100 80 60 40 20 0

Variation in no. of speakers

8

12

16

20

24

26

No. of speakers

Fig 7 Variation in the number of speakers Conclusion A very simple technique based on power distribution in frequency spectrum has been introduced. This technique gives very good results for both text-dependent and textindependent systems. The results also show that accuracy increases as the number of feature vectors in the database for each sample increases. The present study is still ongoing, which may involve different techniques to find the feature vectors and their comparison. References [1] S. Pruzansky, “Pattern-matching procedure for automatic talker recognition”, J.A.S.A., 35, pp. 354-358, 1963. [2]G.R.Doddington, “A method of speaker verification”, J.A.S.A., 49,139 (A), 1971. [3] P.D. Bricker, et. al., ”Statistical techniques for talker identification”, B.S.T.J., 50, pp. 14271454, 1971. [4]K.P.Li, et. al., “Experimental studies in speaker verification using a adaptive system”, J.A.S.A., 40, pp. 966-978, 1966.

46

Journal of Sci., Engg. & Tech. Mgt. Vol 2 (1), January 2010

[18] F. J. Bimbot, et. al., “A tutorial on textindependent speaker verification”, EURASIP Journ. on Applied Signal Processing, pp. 430451, 2004. [19] G.R. Doddington, “Speaker recognition based on idiolectal differences between speakers”, Proc. Eurospeech, pp. 2521-2524, 2001. [20] S Furui, “50 years of progress in speech and speaker recognition research”, ECTI Transactions on Computer and Information Technology, Vol. 1, No.2, November 2005. [21] Marco Grimaldi and Fred Cummins, “Speaker Identification using Instantaneous Frequencies”, IEEE Transactions on Audio, Speech, and Language Processing, vol., 16, no. 6, August 2008. [21] H. B. Kekre, Tanuja K. Sarode, “Speech Data Compression using Vector Quantization”, WASET International Journal of Computer and Information Science and Engineering (IJCISE), Fall 2008, Volume 2, Number 4, pp.: 251-254, 2008. http://www.waset.org/ijcise.

working under his guidance have received best paper awards. Currently he is guiding ten Ph.D. students. Vaishali Kulkarni has received B.E in Electronics Engg. from Mumbai University in 1997, M.Tech (Electronics and Telecom) from Mumbai University in 2006. Presently she is pursuing Ph. D from NMIMS University. She has a tezching experience of more than 7 years. She is Assistant Professor in telecom Department in MPSTME, NMIMS University. Her area of interest include Speech processing: Speech and Speaker Recognition

Author Biographies Dr. H. B. Kekre has received B.E. (Hons.) in Telecomm. Engg. from Jabalpur University in 1958, M.Tech (Industrial Electronics) from IIT Bombay in 1960, M.S.Engg. (Electrical Engg.) from University of Ottawa in 1965 and Ph.D. (System Identification) from IIT Bombay in 1970. He has worked Over 35 years as Faculty of Electrical Engineering and then HOD Computer Science and Engg. at IIT Bombay. For last 13 years worked as a Professor in Department of Computer Engg. at Thadomal Shahani Engineering College, Mumbai. He is currently Senior Professor working with Mukesh Patel School of Technology Management and Engineering, SVKM’s NMIMS University, Vile Parle(w), Mumbai, INDIA. He has guided 17 Ph.D.s, 150 M.E./M.Tech Projects and several B.E./B.Tech Projects. His areas of interest are Digital Signal processing, Image Processing and Computer Networks. He has more than 250 papers in National / International Conferences / Journals to his credit. Recently six students

47

Suggest Documents