A DSP-based Vector Quantized Mel Frequency ...

17 downloads 0 Views 208KB Size Report
Institute of Electronics Engineers of the Philippines (IECEP) Journal. Vol. 3, No. 1, pp. ... to use the speaker voice to control access to the restricted services, for ...
Institute of Electronics Engineers of the Philippines (IECEP) Journal Vol. 3, No. 1, pp. 1 – 5, May 2014. ISSN 2244 – 2146 (Print)

A DSP BASED VECTOR QUANTIZED MEL FREQUENCY CEPSTRUM COEFFICIENTS FOR SPEECH RECOGNITION Ronnel Ivan A. Casil, Ericson D. Dimaunahan, Maria Elena C. Manamparan, Bahareh Ghorban Nia, and Angelo A. Beltran, Jr.

 Abstract—This paper presents a novel method for speech recognition system which then uses the mel frequency cepstrum coefficients (MFCC) and also with the vector quantization (VQ) for processing. The system consists of two phases: the training phase where a database of known speakers are stored, and the testing phase to recognize and identify speech. The MFCC processing is used for feature extraction and the VQ is used for feature matching for the speech recognition system of this paper. Simulation studies have been performed and carried out for speech recognition using MATLAB equipped with DSP based MFCC and VQ to verify the effectiveness of the proposed scheme. Simulation results have shown that the proposed method is remarkably effective for speech recognition. The proposed scheme is less complex, easy to realize, feasible, and accurate. Index Terms—Mel frequency cepstrum coefficients, Speech recognition, Digital signal processing, Vector quantization

I. INTRODUCTION

S

PEAKER RECOGNITION technology makes it possible to use the speaker voice to control access to the restricted services, for example, for giving commands to computer, phone access to banking, database services, shopping or voice mail and access to secure equipment [1]. Speaker recognition is a process of automatically recognizing who is speaking on the basis of individual information included in speech signals. It can be divided into speaker identification and speaker verification. The speaker identification determines registered speaker provides the given utterance from amongst a set of known speakers while speaker verification accepts or rejects the identity claim of a speaker [1]. Speaker identification task

Ronnel Ivan A. Casil earned his B.S. ECE from De La Salle University, Taft, Manila, Philippines (e-mail: [email protected]) Ericson D. Dimaunahan is a Cisco Systems Lecturer at the Mapua Institute of Technology, Manila, Philippines (e-mail: [email protected]) Ma. Elena C. Manamparan is with the Interface Core Team, Test Equipment Engineering Department, Analog Devices, General Trias, Cavite, Philippines (e-mail: [email protected]) Bahareh Ghorban Nia earned her B.S. EE from Sajjad University, Masshad, Iran (e-mail: [email protected]) Angelo A. Beltran, Jr. (e-mail: [email protected])

can then be further classified into the text dependent or text independent task. In the former case, the utterance presented to the system is known beforehand. In the latter case, no assumption about the text being spoken is made, but the system must model the general underlying properties of the speaker vocal spectrum. In general, the text dependent systems are more reliable and accurate, since both the content and voice can be compared. This paper is organized as follows. Section II presents the discussion of MFCC and VQ. Simulation results are shown and discussed in Section III. A conclusion is finally given in Section IV. II. MEL FREQUENCY CEPSTRUM COEFFICIENTS AND VECTOR QUANTIZATION A. Mel Frequency Cepstrum Coefficients (MFCC) The first step in any DSP based signal conversion is to extract features or identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise, emotion, among others [2]. The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth, etc. This shape determines what sound comes out and if it can determine the shape accurately, this could give a better accurate representation of the phenome being reproduced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum and the job of MFCC is to accurately represent this envelope. Following are the procedure for MFCC [2]: (1) Frame the signal into short frames. An audio signal is constantly changing; hence, to simplify things it is assumed that on short time scales the audio signal does not change much. This is why the signal is framed into 20 - 40ms. If the frame is much shorter, this does not have enough samples to get a reliable spectral estimate. If it is longer, the signal changes too much throughout the frame. (2) Calculate the power spectrum of each frame. This is motivated by the human cochlea (an organ in the

Institute of Electronics Engineers of the Philippines (IECEP) Journal Vol. 3, No. 1, pp. 1 – 5, May 2014. ISSN 2244 – 2146 (Print) ear) which vibrates at different spots depending on the frequency of the incoming sounds. This depends on the location in the cochlea that vibrates (which wobbles small hairs) different nerves fire informing the brain that certain frequencies are present. The periodogram estimate performs a similar job in this paper, identifying which frequencies are present in the frame. (3) Apply the Mel Filterbank to power spectra. The periodogram spectral estimate still contains a lot of information that is not required for speech recognition. In particular, the cochlea cannot discern the difference between the two closely spaced frequencies. This effect becomes more pronounced as the frequencies increase. For this reason, we take clumps of periodogram bins and sum them up to get an idea of how much energy exists in various frequency regions. This is performed by Mel filter bank; the first filter is very narrow and gives an indication of how much energy exists near 0 Hertz. As the frequencies get higher, our filters get wider as we become less concerned about variations. We are only interested in roughly how much energy occurs at each spot. The Mel scale tells us exactly how to space our filter banks and how wide to make them. (4) Take the log of all filter bank energies. This is also motivated by human hearing: we do not hear loudness on a linear scale. Generally to double the perceived volume of a sound we need to put eight times energy into it. This means that large variations in energy may not sound all that different if the sound is loud to begin with. This compression operation makes our features match more closely what humans actually hear. (5) Take and keep the discrete cosine transform (DCT). There are two main reasons why this is performed. Our filter banks are all overlapping the filter bank energies are quite correlated with each other. The DCT decorrelates the energies which means diagonal covariance matrices can be used to model the features in (e.g. a HMM classifier). But notice that only 12 of the 26 DCT coefficients are kept. This is because the higher DCT coefficients represent the fast changes in the filter bank energies and it turns out that these fast changes actually degrade speech recognition performance so we get a small improvement by dropping them. The problem of speaker recognition belongs to a much broader topic in scientific and engineering so called pattern recognition [2]. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques stated in this paper. The classes here refer to the individual speakers. Since the classification procedure in our case is applied on extracted features, it can be also referred to as feature matching. Furthermore, if there occurs some set of patterns that the individual classes of which are already known, then one has a

problem in supervised pattern recognition. These patterns comprise the training set and are used to derive a classification algorithm. Then the remaining patterns are then used to test the classification algorithm; these patterns are collectively referred to as the test set. If the correct classes of the individual patterns in the test set are also known, then one can evaluate the performance of the algorithm. In speech signal processing, the MFCC is a representation of the short term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a non linear mel scale of frequency [2]. MFCC are commonly derived as follows: (1) Take the Fourier Transform of windowed excerpt of a signal. (2) Map the powers of the spectrum obtained above onto the mel scale using triangular overlapping windows. The Mel scale relates perceived frequency or pitch of a pure tone to its actual measured frequency. Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies. Incorporating this scale makes our features match more closely what humans hear. The equation for converting from frequency to Mel scale is

f   M ( f )  1125In 1    700 

(1)

Then to proceed from Mel scale back to frequency m  1125  M (m)  700  e  1   1

(2)

(3) Take the log of the powers at each of the Mel frequencies. (4) Obtain the discrete cosine transform of the list of Mel log powers, as if it were a signal. To take the Discrete Fourier Transform of the frame, perform the following N

Si ( k )   si ( n) h(n)e

 j 2 kn N

(3)

n 1

1 k  K where

h (n )

is an

N

sample long analysis window (e.g.

hamming window) and k is the length of the DFT. The periodogram based power spectral estimate for the speech frame

si ( n)

is given by

Pi (k ) 

1 2 si (k ) N

(4)

This is called the Periodogram that estimates then the power spectrum.

Institute of Electronics Engineers of the Philippines (IECEP) Journal Vol. 3, No. 1, pp. 1 – 5, May 2014. ISSN 2244 – 2146 (Print) Now we create our filterbanks. The first filterbank will start at the first point, reach its peak at the second point, then return to zero at the 3rd point. The second filterbank will start at the 2nd point, reach its max at the 3rd, then be zero at the 4th, etc. An equation for calculating these is as follows

M is the number of filters we want and f (...) is the list of M  2 Mel spaced frequencies. where

B. Vector Quantization (VQ) Vector quantization is a technique of mapping vectors from a large vector space to a finite number of regions in that space. Each region in that space is called a cluster that can be represented by its center called a code word. The collection of all code words is called a codebook [4]. Some feature matching techniques used in speaker recognition includes dynamic time warping (DTW), hidden markov model (HMM) and VQ [4]. In this paper, the VQ approach is used due to ease of implementation and good accuracy. Speaker 1

Speaker 1 centroid sample

Speaker 2

VQ distortion

Speaker 2 centroid sample Fig. 1. Conceptual diagram illustrating vector quantization codebook formation. One speaker can be discriminated from another based upon the location of centroids [5].

Fig. 1 illustrates the conceptual diagram that demonstrates the speech recognition process. Only two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In the training phase using the clustering algorithm described later in this paper, a speaker specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The result code

words (centroids) are shown in Fig. 1 by the black circles and black triangles for speaker 1 and 2, respectively. The distance from a vector to the closest code word of a code book is called a VQ distortion. In the recognition phase, an input sound of an unknown voice is ‘vector-quantized’ using each trained codebook the total VQ distortion is computed. Then the speaker corresponding to the VQ code book with smallest total distortion is identified as the speaker of the input sound. After the recognition phase, the acoustic vectors extracted from input speech of each speaker provide a set of training vectors for that speaker. The next important step is to build a speaker specific VQ codebook for each speaker using those training vectors. There is a well known algorithm, namely LBG algorithm [3] for clustering a set of L training vectors into a set of M codebook vectors. The algorithm is formally implemented by the following recursive procedure [4] (1) Design a one vector codebook; this is the centroid of the entire set of training vectors; hence, no iteration is required here. (2) Double the size of the codebook by splitting each current codebook

yn

according to the rule

yn  yn (1 ) (6)  n

y  yn (1) where n varies from 1 to the current size of the codebook and is a splitting parameter (we choose = 0.01). (3) Nearest neighbor search: for each training vector, find the code word in the current codebook that is closest (in terms of similarity measurement) and then assign that given vector to the corresponding cell. (4) Centroid update: update the code word in each cell using the centroid of the training vectors assigned to that cell. (5) Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold. (6) Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed. Instinctively, the LBG algorithm designs an M vector codebook in stages. It starts first by designing a one vector codebook, then uses a splitting technique on the code words to initialize the search for a two vector codebook, and continues the splitting process until the desired M vector codebook is obtained. Fig. 2 illustrates in a flow diagram the detailed steps of the LBG algorithm. The ‘cluster vector’ is the nearest-neighbor search procedure which assigns each training vector to a cluster associated with the closest code word. ‘Find centroids’ is the centroid update procedure. ‘Compute D (distortion)’ sums the distances of all training vectors in the nearest-neighbor search so as to determine whether the procedure has converged.

Institute of Electronics Engineers of the Philippines (IECEP) Journal Vol. 3, No. 1, pp. 1 – 5, May 2014. ISSN 2244 – 2146 (Print)

Find centroid Yes

m

Suggest Documents