Language Identification using Warping

Language Identification using Warping and the Shifted Delta Cepstrum Felicity Allen, Eliathamby Ambikairajah School of Electrical Engineering and Telecommunications University of New South Wales Sydney, Australia

Julien Epps

Interfaces, Machines and Graphic Environments National ICT Austalia Eveleigh 1430, Australia

anbc uns.dua Abstract- This paper proposes the novel use of feature warping for automatic language identification, in combination with the Shifted Delta Cepstrum (SDC) and Perceptual Linear Predictive coefficients in a Gaussian Mixture Model (GMM) based system. Experimental results on various configurations of front-end techniques reported herein demonstrate that, besides providing robustness against channel mismatch and noise as found in existing literature, feature warping is useful more generally as a technique for pre-mapping data for improved compatibility with a GMM back-end. The configuration reported in this paper provides a language identification performance of 76.4% using the OGIINIST database, a 46.5% relative reduction in error rate when compared with a benchmark system employing Mel frequency cepstral coefficients and the SDC. Key Words - Language Identification, Feature

Warping, Shifted Delta Cepstrum I. INTRODUCTION research into language Traditionally, id.entification (LID) has favoured phonetic approaches, du.e to early research indicating the greater efficacy of these techniqu.es [1]. The alternative acoustic approach, based, on Gau.ssian Mixture Mod.els (GMM), was initially shown to produce much poorer results. However, the increased, computing power now available, the introdu.ction of improved training algorithms by Wong et al. [2] and, the replacement of the u.sual d.elta and, acceleration coefficients by the Shifted, Delta Cepstrum (SDCs) proposed, by Bielefeld, [3] have combined, to produ.ce acoustic LID systems capable of achieving results comparable with or su.perior to their phonetic counterparts [4]. The main advantages of the acou.stic GMM approach over the phonetic approach are the significantly lower computational costs and, the elimination of the need, for expensive phonetic transcriptions of language data. Thu.s LID systems based on the acoustic approach are

likely to be cheaper and, more suitable for realtime applications. This paper proposes the use of feature warping, a technique not used, in acoustic language identification systems to date. Recently used by Pelecanos et al [5] for speaker identification and, by Torre et al. [6] for robust speech recognition, feature warping maps the short-term distribution of each feature stream to a standardized distribution. This paper demonstrates that feature warping is effective not only as a technique for robustness against channel mismatch and noise compensation but as a method, for pre-mapping data to create greater compatibility with the GMM back-end. II. EXISTING FRONT-END TECHNIQUES Feature Warping 1) Feature warping, also known in the image processing literature as histogram equalization and in the speech processing literature as cumulative distribution mapping, was suggested as a method to provide robustness against channel mismatch and non-linear noise effects [5],[6]. The technique maps each feature vector stream to a standardized distribution over a specified, time interval. Speech recognition [6] and, speaker verification [5] tasks using this technique were found to exhibit superior performance to those using other methods including Cepstral Mean Subtraction (CMS) and mean and, variance normalization. According to the methodology described in [5], a sliding window of N samples is used for each feature stream. The new warped, feature value is calculated for the cepstral feature in the centre of the sliding window as follows: The ranking R is calculated as the new index of the central cepstral feature when the features in the window are sorted in descending order. R is then used as an index into a pre-calculated, lookup table with the values given by m in

N+-K

R =N 2~~~~ h(z)dz, oo0

(1)

where h(z) is the target probability d.istribution. 2) Shifted Delta Cepstrum Typically language and, speaker recognition tasks use feature vectors containing cepstra and, delta and acceleration cepstra. Recently, however, the Shifted, Delta Cepstrum (SDC) has been found to exhibit superior performance to the d.elta and acceleration cepstra in a number of language identification stud.ies [3],[4] [7] due to its ability to incorporate additional temporal information, spanning multiple frames, into the feature vector. First proposed, by Bielefeld, [3] SDCs are obtained, by concatenating the d.elta cepstra computed across multiple frames of speech. Four parameters (AT, d, P, and, k) specify the computation of the SDCs as follows [7]: N cepstral coefficients are computed, each frame, then for each of these cepstral streams the final vector at time t is given by the concatenation of the Ac(t+iP) for all 0 < i