Overview of compression and packet loss effects in ... - Semantic Scholar

1 downloads 0 Views 223KB Size Report
The joint packet loss and compression effects over IP networks ... loss over IP networks (section 3). ..... of the consortium is to emphasise assessment of perform-.
BIOMETRICS ON THE INTERNET

Overview of compression and packet loss effects in speech biometrics L. Besacier, P. Mayorga, J.-F. Bonastre, C. Fredouille and S. Meignier Abstract: An overview is presented of compression and packet loss effects in speech biometrics. These new problems appear particularly in recent applications of biometrics over mobile or Internet networks. The influence of speech compression on speaker recognition performance in mobile networks is investigated. In a first experiment, it is found that the use of GSM coding degrades the performance. In a second experiment, the features for the speaker recognition system are calculated directly from the information available in the encoded bit stream. It is found that a low LPC order in GSM coding is responsible for most performance degradations. A speaker recognition system was obtained which is equivalent in performance to the original one which decodes and reanalyses speech before performing recognition. The joint packet loss and compression effects over IP networks are also studied. It is experimentally demonstrated that the adverse effects of packet loss alone are negligible, while the encoding of speech, particularly at a low bit rate, coupled with packet loss, can reduce the verification accuracy considerably.

1

Introduction

The deployment of biometrics on the Internet is a multidisciplinary task. It involves person authentication techniques based on signal processing, statistical modelling and mathematical fusion methods, as well as data communications, computer networks, communication protocols, and online data security. The necessity for the latter discipline is due to the fact that an online robust biometric authentication strategy would be of little or no value if, for instance, hackers could break into the personal identification server to control the verification of their pretended identities, or could access personal identification data transmitted over the network. Over IP networks, both speech and image-based biometrics are viable alternative approaches to verification. Focusing on speech biometrics, some predictions for the year 2005 show that 10% of voice traffic will be over IP. This means that speaker verification technology will have to face new problems. The most evident architecture seems to be client – server based, where a distant speaker verification server is remotely accessed by the client for authentication. In this scenario, the speech signal is transmitted from the client terminal to a remote speaker verification server. Compression of the speech signal is then generally necessary to reduce transmission delays and to respect bandwidth constraints. Many problems can appear with this

kind of architecture, particularly when the transmission is made via the Internet: . First, transcoding (the process of coding and decoding) modifies the spectral characteristics of the speech signal, and thereby can adversely affect the speaker verification performance. . Secondly, transmission errors can occur on the transmission line, thus, data packets can be lost (for example with UDP transport protocols which do not implement any error recovery). . Thirdly, the time response of the system is increased by coding, transmission and possible error recovery processes. This delay (termed ‘jitter’ as used in the domain of computer networks) can be potentially very disturbing. For example, in some applications (e.g. man –machine dialogue), speaker verification is only one subsystem amongst a number of other subsystems. In such cases, the effective operation of the whole system depends heavily on the response time of the individual subsystems. . Finally, speech packets (or other personal information) transmitted over IP networks could be intercepted and captured by impostors, and subsequently used, for instance, for fraudulent access authorisation.

This paper attempts to present an overview of issues and problems in the first two points. These include speaker verification robustness to speech compression in mobile networks (section 2) and to joint compression and packet loss over IP networks (section 3). Section 3 of this work was conducted in the framework of COST Action 275 [1].

q IEE, 2003 IEE Proceedings online no. 20031033 doi: 10.1049/ip-vis:20031033 Paper first received 28th April and in revised form 18th September 2003 L. Besacier and P. Mayorga are with CLIPS/IMAG, UJF, BP 53, 38041 Grenoble cedex 9, France J.-F. Bonastre, C. Fredouille and S. Meignier are with LIA, University of Avignon, 339 ch. des Meinajaries, BP 1228, 84911 Avignon Cedex 9, France 372

2 Speech compression effects in mobile networks

2.1 Introduction The first main part of this paper proposes to evaluate the influence of speech compression on text-independent speaker recognition performance, in the framework of IEE Proc.-Vis. Image Signal Process., Vol. 150, No. 6, December 2003

mobile networks (GSM compression). To our knowledge, few contributions [2 – 5] have been made on this subject, whereas the effect of perturbations in a mobile framework has been more extensively studied for automatic speech recognition, where we can cite among others [6 – 8]. Recently, a study was also reported on distributed speaker recognition [9] where ETSI AURORA front-ends were applied to the speaker recognition task to avoid speech coding effects. The three existing GSM speech coders are referred to as the full-rate (13 kbit/s), half-rate (5.6 kbit/s) and enhanced full-rate (12.2 kbit/s) GSM coders. A well known speech database was passed through these coders, obtaining three transcoded versions, as explained in Section 2.2. Two different experiments were carried out. In the first experiment (Section 2.3), the speaker verification performance degradation due to the utilisation of the three GSM speech coders was assessed. In the second experiment (Section 2.4), the features for the speaker recognition system were calculated from the information available in the encoded bit stream. This experiment allows a measurement of the degradation introduced by different aspects of the coder, and gives some guidelines for better and faster use of the information available in the bit stream for speaker recognition purposes.

between the log-likelihoods of the claimant model and the background model.

2.3.2 Results: Table 1 shows the speaker verification (% equal error rate) results, respectively, obtained on TIMIT16k, TIMIT8k and the GSM-transcoded TIMIT (FR, HR and EFR).

2.3.3 Comments: The results show a significant performance degradation when using GSM-transcoded databases, compared to the normal and downsampled versions of TIMIT. The results obtained are in correspondence with the perceptual speech quality of each coder. That is, the higher the speech quality, the higher the measured recognition performance. These results are equivalent to those obtained in [4], whereas [2] and [3] suggest that the GSM coding does not introduce major degradations. The following Section is devoted to studying the source of the degradation observed with GSM-transcoded speech (only the FR coder is studied). The possibility of performing recognition using directly codec parameters rather than parameters extracted from the resynthesised speech is also investigated.

2.4 Second experiment 2.2 GSM transcoding of an existing database The whole TIMIT [10] database was downsampled from 16 kHz to 8 kHz, using a 158th-order linear-phase FIR halfband filter, with a very steep transition band (150 Hz of transition band), a very flat passband (passband ripple < 0:1 dB), and more than 97 dB of attenuation in the stop band. Thus, the downsampled speech files contain all the frequencies of the original TIMIT in the 0 –4 kHz range. Hereafter, the downsampled database will be referred to as TIMIT8k, while the original will be referred to as TIMIT16k. We are aware of the fact that the actual antialiasing low-pass filter of a mobile phone may not have such ideal characteristics. TIMIT8k was transcoded using the three GSM speech coders.

2.3 First experiment 2.3.1 Experimental framework: A well known protocol was used on TIMIT for speaker identification and verification. It is called the ‘long training/short test protocol’ [11]. 430 speakers (147 women and 283 men) of the database are used and the whole test set thus consists of 430  5 ¼ 2150 test patterns of 3.2 s each, on average. The remaining 200 speakers of the database are used to train the background model needed for the speaker verification experiments. 2150 client accesses and 2150 impostor accesses are made (for each client access, an impostor speaker is randomly chosen among the 429 remaining speakers). All the experiments were carried out under matching conditions (i.e. training and testing are both made using the same database). The speech analysis module extracts, every 10 ms, 16 cepstral coefficients calculated on a frame of length 30 ms. A GMM classifier [12] of N ¼ 16 mixtures with diagonal covariance matrices was tested. The number of Gaussian components is small, but preliminary experiments had shown that increasing N does not significantly improve performance on a relatively ‘easy’ database like TIMIT. During recognition, the verification score for an utterance is the log-likelihood ratio computed by taking the difference IEE Proc.-Vis. Image Signal Process., Vol. 150, No. 6, December 2003

The purpose of this experience is twofold; to find out which portions of the encoder are responsible for major degradations and to improve the performance with respect to the results obtained by extracting the features from resynthesised speech. The idea is to extract the features directly from the encoded bitstream instead of from the resynthesised speech. Since speech analysis (extraction of features) is a process that occurs not only for speech coding, but also for recognition, one can avoid doing this analysis job twice by taking adequate features for recognition directly from the encoded bitstream data. The advantage of this is twofold; first, it should allow the use of ‘clean’ coefficients extracted on the client terminal side and moreover, since decoding and analysis steps are removed on the server side, the time response of the overall system should be reduced. This idea of extracting features from the bitstream was already proposed for speech recognition in [13] and [14], but never for speaker recognition, which is a ‘purely’ acoustic task susceptible to be much more affected by GSM coding distortions than ASR. This hypothesis is based on the following: speaker recognition technology uses only acoustic models which can be degraded by compression, whereas speech recognition technology (ASR) uses acoustic models and also language models. Thus, degradation of the speech recogniser acoustic model can be recovered by the language model, especially if the language model is well adapted to the test data (which was actually observed in a former paper [8]). Results obtained during this experiment are reported in Table 2. Table 1: Speaker verification results (%EER) for original and GSM transcoded speech Original

GSM transcoded

TIMIT16k

TIMIT8k

FR

HR

EFR

%

%

%

%

%

1.1

5.1

7.3

7.8

6.6

373

Table 2: Speaker verification results for the second experiment (Section 2.4) Coefficients

EER (%)

1

Baseline: resynthesised speech FR

7.3

2

LPC8 !c0 –c15

7.0

3

LPC8 !c1 –c15

7.8

4

LPC12 !c0–c15

5.5

5

FR (NO Q) !c1– c15

7.5

6

FR (NO Q) !c1– c16

7.5

7

FR (WITH Q) !c1–c15

8.4

8

Codec parameter FR (with q) !c0–c15

7.0

Line 1 in Table 2, corresponding to the baseline, lists the values reported from the TIMIT FR experiment in Table 1. All the processes (lines 2 – 8) were carried out using TIMIT8k, but the feature extraction was made compatible with the FR coder characteristics: 20 ms segmentation, calculation of 8th order LPC (LPC8), calculation of cepstral coefficients c1– c15 from the LPC using the well known recursion for minimum phase signals, and calculation of c0 using log (E ), where E is the energy of the LPC residual. The results obtained with this feature extraction are given in line 2 of Table 2. For lines 2– 4, the feature extraction is done with a C-program, using double-precision floatingpoint arithmetic: Line 3 uses only cepstral coefficients c1 –c15 (no energy term c0), Line 4 uses an LPC model order of 12 instead of eight. Feature extraction for lines 5 – 8, is carried out using the FR C-program, which uses a simulated 16-bit fixed-point arithmetic: Line 5 uses c1 –c15, from LPC before quantisation (LPC coding– decoding), Line 6 uses c1– c16, from LPC before quantisation Line 7 uses c1– c15, from LPC after quantisation. Line 8 uses c1 – c15, from LPC after quantisation, and c0, which is calculated using log (Eˆ), where Eˆ is the energy of the reconstructed LPC residual. Comments from pairwise comparison on the data in Table 2 can be made: Lines 1 –2: The use of the new feature extraction (more compatible with the FR characteristics) does not introduce significant distortion. Lines 2 –3: The use of c0 (more laborious to calculate from the bit-stream) seems important for performance. Lines 2 –4: A low LPC order in GSM FR coding (LPC8) is responsible for most performance degradations. Better results are likely to be obtained in experiences using the EFR, which has a 10th order LPC. This result (high degradation for SV task), compared to results obtained with the same coder, but on the speech recognition task [8] (where only a slight degradation is observed for the ASR task), may also suggest that speaker specific information is present more in higher-order LPC coefficients than linguistic information. Lines 5– 6: No performance improvement is expected by retaining cepstral coefficients beyond c15 without increasing the LPC order. Lines 5 –7: LPC quantisation in the FR coder decreases the verification performance. Lines 7 – 8: The c0 calculated from the reconstructed residual significantly improves the performance. Conclusions from lines 1– 8: By extracting the features directly from the information in the encoded bit-stream, we 374

have managed to obtain a speaker verification system which is equivalent to the baseline. The performance is even slightly (but not significantly) improved, and useless processing was removed. 3 Joint packet loss and compression effects over IP networks This section proposes a methodology for evaluating the speaker verification performance over IP networks. The idea is to duplicate an existing and well known database used for speaker verification (XM2VTS) by passing its speech signals through different coders and different network conditions representative of what can occur over the internet. Section 3.1 is dedicated to the database description and to the degradation methodology adopted, whereas Section 3.2 presents the speaker verification system (which is different from those of Section 2) and some results obtained with this IP-degraded version of XM2VTS.

3.1 Database used and degradation methodology 3.1.1 XM2VTS database: In acquiring the XM2VTS [15] database, 295 volunteers from the University of Surrey visited a recording studio four times at approximately one month intervals. On each visit (session) two recordings (shots) were made. The first shot consisted of speech while the second consisted of rotating head movements. This multimodal database is used by many partners of COST Action 275. The work described in this paper was made on its speech part where the subjects were asked to read three sentences twice. The three sentences remained the same throughout all four recording sessions and a total of 7080 speech files were made available on four CD-ROMs. The audio, which had originally been stored in mono, 16 bit, 32 kHz, PCM wave files, was downsampled to 8 kHz. This is the input sampling frequency required in the speech codecs considered in this study.

3.1.2 Codec used: H323 is a standard commonly used for transmitting voice and video over IP networks. The audio codecs used in this standard are G711, G722, G723, G728 and G729. We propose to use in our experiments the codec which has the lowest bit rate: G723.1 (6.4 and 5.3 kbit/s), and the one with the highest bit rate: G711 (64 kbit/s).

3.1.3

Packet loss simulation: There are two main transport protocols used on IP networks. These are UDP and TCP. While UDP protocol does not allow any recovery of transmission errors, TCP includes some error recovery processes. However, the transmission of speech via TCP connections is not very realistic. This is due to the requirement for real-time (or near real-time) operations in most speech-related applications [16]. As a result, the choice is limited to the use of UDP which involves packet loss problems. The process of audio packet loss can be simply characterised using a Gilbert model [17, 18] consisting of two states (Fig. 1). One of the states (state 1) represents a packet loss and the other state (state 0) represents the case where packets are correctly transmitted. The transition probabilities in this statistical mode, as shown in Fig. 1, are represented by p and q. In other words, p is the probability of going from state 0 to state 1, and q is the probability of going from state 1 to state 0. Different values of p and q define different packet loss conditions that can occur on the Internet. The probability that n consecutive packets are lost is given by pð1  qÞn1 : IEE Proc.-Vis. Image Signal Process., Vol. 150, No. 6, December 2003

Fig. 1 Gilbert model

If ð1  qÞ > p; then the probability of losing a packet is greater after having already lost a packet than after having successfully received a packet [18]. This is generally the case in data transmission on the internet where packet losses occur as bursts. Note that p þ q is not necessarily equal to 1.When p and q parameters are fixed, the mean number of consecutive packets lost can be easily calculated as p=q2 . Of course, the larger this mean, the more severe is the degradation. Different values of p and q representing different network conditions considered in this study were used. Table 3 includes a summary of the different coders and simulated network conditions that were considered: . Two degraded versions of XM2VTS were obtained by applying G711 and G723 codecs alone, without any packet loss . Six degraded versions of XM2VTS were finally obtained using simulated packet loss conditions: two conditions (average/bad) and three speech qualities (clean/G711/ G723). The simulated average and bad network conditions considered in this study corresponded to 9% and 30% speech packet losses, respectively. Each packet contained 30 ms of speech, which was consistent with the duration proposed in RTP (real-time protocol used under H323).

3.2 Speaker verification experiments with the ELISA system The ELISA Consortium groups several public laboratories working on speaker recognition. One of the main objectives of the consortium is to emphasise assessment of performance. Particularly, the consortium has developed a common speaker verification system, which has been used for participating at various NIST speaker verification evaluation campaigns [19, 20]. The ELISA system is a complete framework designed for speaker verification. It is a GMM-based system including audio parameterisation as well as score normalisation techniques for speaker verification. This system was presented at NIST from 1998 to 2002 and showed the state-of-the-art performance. ELISA is now collaborating with COST Action 275 concerning performance assessment of multimodal person authentication systems over the Internet. ELISA evaluated the speaker verification performance using the COST 275 dedicated database detailed in Section 3.1.

3.2.1 Speaker verification protocol on XM2VTS: For the purpose of this investigation, the Lausanne protocol (configuration 2) is adopted. This has

already been defined for the XM2VTS database. There are 199 clients in the XM2VTS DB. The training of the client models is carried out using full session 1 and full session 2 of the clients part of XM2VTS. 398 client test accesses are obtained using full session 4 ( 2 shots) of the clients part. 111 440 impostor accesses are obtained using the impostor part of the database (70 impostors  4 sessions  2 shots  199 clients ¼ 111 440 impostor accesses). The 25 evaluation impostors of XM2VTS are used to develop a world model. The text-independent speaker verification experiments are conducted in matched conditions (same training/test conditions).

3.2.2 ELISA system on XM2VTS: The ELISA system on XM2VTS is based on the LIA system presented to the NIST 2002 speaker recognition evaluation. The speaker verification system uses 32 parameters (16 LFCC þ 16 DeltaLFCC). Silence frame removal is applied before centring (CMS) and reducing vectors. For the world model, 128 Gaussian-component GMM was trained using Switchboard II phase II data (8 kHz landline telephone) and then adapted (MAP [21], mean only) on XM2VTS data (25 evaluation impostors set). The client models are 128 Gaussian-component GMM developed by adapting (MAP, mean only) the previous world model. Decision logic is based on using the conventional loglikelihood ratio (LLR). No LLR normalisation, such as Znorm [22], Tnorm [23] or Dnorm [24], is applied before the decision process.

3.2.3 Results: The speaker verification performance with the simulated degraded versions of XM2VTS is presented in Table 3. Based on these results, it can be concluded that the degradation due to packet loss is negligible regarding the degradation due to compression for text-independent speaker verification, even with bad network conditions. Comparing these results with those for speech recognition [25], it can be said that the speaker verification performance is far less sensitive to packet loss. It is probably due to the fact that the modelling is GMM, which considers every frame as an independent entity. Then GMMs are not sensitive to temporal breakdown induced by packet loss and the only consequence is a reduction of the amount of signal data available for taking a decision. We feel that conclusions would be very different in a textdependent mode where temporal information is important. On the other hand, Table 3 shows that the speaker verification performance is adversely affected when the speech material is encoded at low bit rates (e.g. using G723.1). These results are in agreement with those in Section 2 of this paper, describing the performance of speaker verification in the framework of mobile devices with speech compression. 4

Conclusions

This paper has focused on the emerging need of vocal biometric user authentication over the internet. More precisely, it has presented the constraints tied to the use of

Table 3: Results (%EER) of the experiments using degraded XM2VTS No packet loss

Clean (128 kbit/s) 0.25%

G711 (64 kbit/s) 0.25%

G723 (5.3 kbit/s) 2.68%

Average network conditions p ¼ 0:1; q ¼ 0:7

Clean (128 kbit/s) 0.25%

G711 (64 kbit/s) 0.25%

G723 (5.3 kbit/s) 6.28%

Bad network conditions p ¼ 0:25; q ¼ 0:4

Clean (128 kbit/s) 0.50%

G711 (64 kbit/s) 0.75%

G723 (5.3 kbit/s) 9%

IEE Proc.-Vis. Image Signal Process., Vol. 150, No. 6, December 2003

375

mobile or Internet transmission channels at the speech signal level. The experiments have shown that the degradation due to packet loss is negligible regarding the degradation due to compression for text-independent vocal person authentication (Section 3). It is probably due to the GMM models used which consider every frame as an independent entity. This is in contrast with previous automatic speech recognition (ASR) experiments, where packet loss was found to reduce the accuracy significantly. However, a degradation of the performance is observed when low bit-rate speech compression is applied to the speech signal (GSM codec in Section 2 and G723.1 codec in Section 3). In this case, packet loss can increase the degradation (Section 3). A methodology for the measurement of the different performance degradation sources within the FR GSM coder was also presented. It also enlightens the perspective to directly exploit the coder output parameters, instead of decoding and reanalysing speech, for robustness and optimisation purposes. This strategy may be an alternative to other effective architecture solutions like DSR [9], which may not be applicable everywhere. For instance, one may want to access a speaker recognition service from a very basic terminal which is just able to capture and transmit speech (a mobile phone, for instance) and which does not implement an adequate DSR front-end. 5

References

1 http://www.fub.it/cost275/ 2 Kuitert, M., and Boves, L.: ‘Speaker verification with GSM coded telephone speech’. Proc. Eurospeech’97, Rhodes, Greece, 1997, vol. 2, pp. 975–978 3 El-Maliki, M., Renevey, P., and Drygajlo, A.: ‘Speaker verification for noisy GSM quality speech’. Proc. COST 254 Workshop, Neuchaˆtel, Switzerland, 5 –7 May 1999 4 Quatieri, T.F., Singer, E., Dunn, R.B., Reynolds, D.A., and Campbell, J.P.: ‘Speaker and language recognition using speech codec parameters’. Proc. Eurospeech’99, Budapest, Hungary, 1999, vol. 2, pp. 787–790 5 Evans, N.W.D., Mason, J.S., Auckenthaler, R., and Stapert, R.: ‘Assessment of speaker verification degradation due to packet loss in the context of wireless mobile devices’. Presented at COST 275 Workshop on Biometrics over the Internet, Rome, Italy, November 2002 6 Dufour, S., Glorion, C., and Lockwood, P.: ‘Evaluation of rootnormalised front-end (RN LFCC) for speech recognition in wireless GSM network environments’. Proc. ICASSP’96, Atlanta, GA, USA, 1996, vol. 1, pp. 77–80

376

7 Karray, L., Jelloun, A.B., and Mokbel, C.: ‘Solutions for robust recognition over the GSM cellular network’. Proc. ICASSP’98, Seattle, WA, USA, 1998, vol. 1, pp. 261–264 8 Besacier, L., Bergamini, C., Vaufreydaz, D., and Castelli, E.: ‘The effect of speech and audio compression on speech recognition performance’. Proc. IEEE Multimedia Signal Processing Workshop, Cannes, France, October 2001 9 Broun, C.C., Campbell, W.M., Pearce, D., and Kelleher, H.: ‘Distributed speaker recognition using the ETSI distributed speech recognition standard’. Presented at 2001: A Speaker Odyssey, The Speaker Recognition Workshop, Crete, Greece, 18 –22 June 2001 10 Fisher, W., Zue, V., Bernstein, J., and Pallet, D.: ‘An acoustic-phonetic database’, J. Acoust. Soc. Am., Suppl. A, 1986, 81, (S92) 11 Bimbot, F., Magrin-Chagnolleau, I., and Mathan, L.: ‘Second-order statistical methods for text-independent speaker identification’, Speech Commun., 1995, 17, (1–2), pp. 177–192 12 Reynolds, D.: ‘Speaker identification and verification using Gaussian mixture speaker models’. Proc. Workshop on Automatic speaker recognition, identification and verification, Martigny, Switzerland, 5–7 April 1994, pp. 27 –30 13 Huerta, J.M., and Stern, R.M.: ‘Speech recognition from GSM codec parameters’. Proc. Int. Conf. on Spoken language processing (ICSLP), Sydney, Australia, 1998 14 Pelaez-Moreno, C., Gallardo-Antolin, A., and Diaz-de-Maria, F.: ‘Recognizing voice over IP: a robust front-end for speech recognition on the World Wide Web’, IEEE Trans. Multimed., June 2001, 3, pp. 209–218 15 http://www.ee.surrey.ac.uk/Research/VSSP/xm2vtsdb/ 16 Black, U.: ‘Voice over IP’ (Prentice Hall, 2001) 17 Bolot, J-C., and Fosse-Parisis, S.: ‘Adaptive FEC-based error control for Internet telephony’. Proc. IEEE Infocom’99, New York, NY, USA, March 1999 18 Yajnik, M., Moon, S., Kurose, J., and Towsley, D.: ‘Measurment and modelling of the temporal dependence in packet loss’. Proc. IEEE Infocom’99, New York, NY, USA, March 1999 19 The ELISA consortium: ‘The ELISA systems for the NIST 99 evaluation in speaker detection and tracking’, Digit. Signal Process., 2000, 10, (1– 3), pp. 143–153 20 Magrin-Chagnolleau, I., Gravier, G., and Blouet, R., for the ELISA consortium: ‘Overview of the ELISA consortium research activities’. Proc 2001: A Speaker Odyssey, Crete, Greece, 18–22 June 2001, pp. 67– 72 21 Gauvain, J.L., and Lee, C.-H.: ‘Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains’, IEEE Trans. Speech Audio Process., April 1994, 2, pp. 291–298 22 Li, K.P., and Porter, J.E.: ‘Normalizations and selection of speech segments for speaker recognition scoring’. Proc. ICASSP, 1988, pp. 595– 598 23 Auckenthaler, R., Carey, M., and Lloyd-Thomas, H.: ‘Score normalization for text-independent speaker verification system’, Digit. Signal Process., 2000, 10, (1–3) 24 Ben, M., Blouet, R., and Bimbot, F.: ‘A Monte-Carlo method for score normalization in automatic speaker verification using Kullback – Leibler distances’. Proc. ICASSP, Orlando, FL, USA, 2002 25 Mayorga-Ortiz, P., Lamy, R., and Besacier, L.: ‘Recovering of packet loss for distributed speech recognition’. Proc. Eusipco 2002, Toulouse, France, Sept. 2002

IEE Proc.-Vis. Image Signal Process., Vol. 150, No. 6, December 2003