2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006
Improving Speaker Identification Rate Using Fractals Fulufhelo V. Nelwamondo, Unathi Mahola and Tshilidzi Marwala, Member, IEEE
Abstract—: This paper reports on a text-dependent speaker identification system that combines Mel-frequency cepstral coefficients with non-linear turbulence information extracted using Multi-Scale Fractal Dimension (MFD). The MFD is estimated using Box-Counting and Minkowiski-Bouligand dimension. The proposed framework is implemented in conjunction with sub-band based speaker identification system. Results show that the proposed framework with Box-Counting feature extraction improves the performance of the classical wideband approach by up to 10% identification rate. It is further observed that the proposed framework gives the improved Bhattacharyya distance between impostors and speakers’ speech distributions.
T
I. INTRODUCTION
HE objective of a speaker recognition task is to identify the person who gave an utterance. Speaker recognition is sub-divided into speaker identification and speaker verification. Speaker Identification (SID) tries to determine which speaker generated a speech signal whereas speaker verification confirms if the portion of the speech belongs to the individual who claims it. Speaker identification can be sub-divided into open-set and close-set [1]. In a close-set, a test speaker is assigned an identity of the speaker whose model best fits the test utterance. In an open-set, an additional test is done to check if the matching score is above a predefined threshold. The decision is accepted or rejected based on this threshold. This paper focuses on openset speaker identification. Speech has non-linear and non-stationary characteristics. Experimental evidence exists that proves that speech signal is produced by a non-linear dynamical system that often generates small or large degree of turbulence [2]. This paper studies the non-linear behavior of the speech using fractal dimension. The effect of using these features with the subband based non-linear speaker modeling technique rather than the convention wideband based speaker modeling has not been previously investigated. For this reason, this paper will implement and report on speaker identification system using this combination. The two major issues that arise in the design of SID system are: feature extraction and speaker modeling. Firstly,
Manuscript received November 29, 2005. F. V. Nelwamondo is with the school of Electrical and information Engineering at the University of the Witwatersrand Johannesburg, South Africa. (e-mail:
[email protected]). U. Mahola is with the school of Electrical and Information engineering at the University of the Witwatersrand Johannesburg, South Africa. (e-mail:
[email protected]). T. Marwala is a professor of Electrical Engineering at the school of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, South Africa. (phone: +27 11 717 7217; fax: +27 11 403 1929; e-mail:
[email protected]).
0-7803-9490-9/06/$20.00/©2006 IEEE
extraction of speaker dependent features from the speech signal is the fundamental problem of speaker recognition system. The standard methodology for solving this problem uses Linear Prediction Coding (LPC) and Mel-Frequency Cepstral Coefficient (MFCC) [3]. Both these methods are linear procedures based on the assumption that speaker features have properties caused by the vocal tract resonances. These features form the basic spectral structure of the speech signal [4]. However, the non-linear information in speech signals is not easily extracted by the present feature extraction methodologies. Previous studies have identified evidence of non-linear behavior of speech sound as vocals folds non-plane wave propagation, higher order statistics and most importantly turbulent airflow [2]. Due to this, there has been a growing interest for non-linear feature extraction methodologies, which can be applied to speaker recognition. For example, previously a theoretical framework was developed for measuring non-linear speech turbulence using multi-scale fractal dimension [5]. Despite the increasing theoretical explanations supporting the importance of non-linear acoustic features, there is a limited number of speaker identification systems that have exploited in detail the concept of non-linear behavior of the speech signal due to their high computation complexity [4]. This paper investigates and implements speaker identification system using both traditional MFCC and non-linear multiscaled fractal dimension feature extraction. Modeling techniques such as Gaussian Mixture Models (GMM), Artificial Neural Networks (ANN) and Hidden Markov Models (HMM) have been used by the conventional speaker identification systems with varying success. Fletcher [6] suggested that linguistic messages in humans get decoded independently in different frequency bands and the final decision is based on combining decisions from the sub-bands. His work resulted in the beginning of sub-bands analysis of speech signals. The sub-band speaker modeling technique has proved to be more effective than the conventional wideband approach and for this reason, the non-linear feature extraction with sub-band based speaker modeling are investigated. The combined model of nonlinear feature extraction and fractals is expected to significantly improve the identification rate. It should be noted that since HMM has been proven to outperform GMM and ANN in many speech related project, this paper will only investigate the performance of the new combined model using HMM based sub-band speaker model [7]. In this paper, we firstly present relevant mathematical background. Thereafter, the speaker identification framework used is discussed and then a reliable confidence measure is also discussed. Finally, the results and
5870
conclusions of the sub-band based speaker identification with non-linear multi-scale fractal dimension feature extraction are presented. II. MATHEMATICAL BACKGROUND A. Hidden Markov Models Both conventional approach and the sub-band based recognizer have used HMM recognizers. The left-to-right model is mostly used in speech or speaker recognition and is also used in this investigation. HMM recognizers are based on Bayesian interference and use Markov Chain Monte Carlo (MCMC) to estimate predictive distribution. MCMC contains Markov chain and Markov Chain Monte Carlo integration [8]. Markov chain is a computation model consisting of finite number of states. The probability of speech feature vectors generated from an HMM is computed using transition probabilities between states and the observation probabilities of feature vectors in a given state. Monte Carlo integration evaluates the posterior distribution of the observed sequence using Markov chain and this distribution is an object of all Bayesian interference and is estimated by [7] E ( f ( s )] ≈
1 n
n
∑
f (St )
(1)
t −1
where E if the expectation of f(s), St is the tth observed state and n is the total number of observed states. There are three central HMM issues present in solving the Markov Chain in this paper. Firstly, the evaluation process, which entails finding the probability that the model generated the sequence of visible states [9]. Secondly, decoding the state sequence that maximizes probability of the observation sequence. Thirdly, training which adjusts model parameters to maximize probability of observed sequence. This last step is simply a problem of determining the reference speaker model for all speakers [9]. B. Fractal Dimension It has been proven that multi-scale structures in turbulence can be accurately modeled using fractal models [10]. This is done by using short time fractal dimension of speech sounds as an additional feature and, thereby, approximating the degree of turbulence of each speaker’s speech sounds. If s ( t ), 0 ≤ t ≤ T represents continuous real valued short time speech signal with graph 2
F = {t, s(t ) ∈ R : 0 ≤ t ≤ T } , (2) where t represents time, R a set of real numbers, the fractal dimension of, F, is equal to its Hausdroff dimension, DH, which generally lies in the range 1