AbstractâThis paper presents a new method of the. Double Talk Detection (DTD) for acoustic echo cancellation. The main goal is to remove the undesirable ...
Double-talk Detector Based on Speech Feature Extraction for Acoustic Echo Cancellation Mahfoud HAMIDIA, Abderrahmane AMROUCHE USTHB, Faculty of Electronics and Computer Science Speech Communication and Signal Processing Laboratory (LCPTS) P.O. Box 32, Bab Ezzouar, Algiers, Algeria {mhamidia, namrouche}@usthb.dz
Abstract—This paper presents a new method of the Double Talk Detection (DTD) for acoustic echo cancellation. The main goal is to remove the undesirable acoustic echoes produced by the coupling between the loudspeaker and the microphone of the mobile station. Acoustic Echo Canceller (AEC) based on adaptive filtering is an attractive solution. In this work, DTD using discriminative speech feature extraction from the near-end and the microphone speech signals was performed. The main purpose is to discriminate between these signals for sensing Double Talk (DT) periods. To evaluate the performance we use the NLMS algorithm to update the filter coefficients. Results obtained from the TIMIT database show that the performances of the proposed method are significantly improved, compared to the Normalized Cross Correlation (NCC) and Geigel methods. Keywords—Acoustic echo canceller (AEC); double-talk detection (DTD); NCC; Geigel; speech feature extraction; NLMS.
I.
𝑋 𝑛 = 𝑥 𝑛 𝑥 𝑛 − 1 ,… 𝑥 𝑛 − 𝐿 + 1
𝑇
3
is the length-L history of the received signal, or far-end signal 𝑥 𝑛 . The microphone signal 𝑑(𝑛) is the sum of the echo signal 𝑦(𝑛), near-end speech signal 𝑠(𝑛) and noise signal 𝑏(𝑛) as: 𝑑 𝑛 = 𝑦 𝑛 + 𝑠 𝑛 + 𝑏(𝑛)
(4)
Loudspeaker 𝒙(𝒏)
Far-end Speaker
Adaptive filter 𝑾(𝒏)
DTD
Echo Path 𝒉
INTRODUCTION
Removing the undesirable acoustic echoes in the communication systems is the role of acoustic echo canceller [1]. The fundamental principle is to identify the acoustic echo path between the loudspeaker and the microphone of the mobile station, which is modeled by the impulse response of the local area. Generally a Finite Impulse Response (FIR) filter is used in this task as is depicted in Fig 1, their coefficients 𝑊(𝑛) are updated by the adaptive filtering algorithm. The acoustic echo signal 𝑦 (𝑛) is the filter resulting of the near-end signal 𝑥 (𝑛) through the room impulse response as is modeled by the following equation: 𝑦 𝑛 = 𝑋𝑇 𝑛
𝒚(𝒏)
Echo
+ 𝒚(𝒏)
𝒅(𝒏)
𝒔(𝒏) 𝒆(𝒏)
+
Microphone
+
+
Near-end Speaker
𝒃(𝒏) Noise
Fig 1. A typical AEC system with a DTD.
The estimated echo signal 𝑦(𝑛) generated by the adaptive filter, which is a linear combination of several inputs at time 𝑛, is subtracted from the microphone signal and it is given by:
1
where
-
𝑦 𝑛 = 𝑋𝑇 𝑛 𝑊 𝑛
(5)
where = 0 1 , … 𝐿−1
𝑇
2
𝐿 is the length of the echo path, the superscript (∙)𝑇 denotes transpose of a vector
𝑊 𝑛 = 𝑤0 𝑛
𝑤1 𝑛 , … 𝑤𝐿−1 𝑛
𝑊(𝑛) is the weight vector of the adaptive filter.
𝑇
(6)
The error signal 𝑒 𝑛 represents the near-end signal estimate if the near-end speaker is active: 𝑒 𝑛 =𝑑 𝑛 −𝑦 𝑛
7
and 𝑒𝑟 𝑛 = 𝑦 𝑛 − 𝑦 𝑛
8
is the residual error, their minimizing indicate a good estimate of the echo path. The filter coefficients are updated using an adaptive filtering algorithm. Various algorithms are proposed in the literature. The most popular algorithm is a Normalized Least Mean Squares (NLMS) defined as [2]: 𝑊 𝑛+1 =𝑊 𝑛 +
𝜀+
𝑋𝑇
𝜇 𝑋 𝑛 𝑒 𝑛 𝑛 𝑋 𝑛
9
where 𝑊 (𝑛 + 1) is the next tap weight value and 𝑊 𝑛 is the present tap weight value of the adaptive filter. 𝜇 is the step size parameter used in the weight vector updating with 0 < 𝜇 < 2 , and 𝜀 > 0 is a regularization constant prevents division by a very small number of the data norm. During the double-talk situation, where the far-end and the near-end speeches are simultaneously active, the adaptive filter coefficients are affected by the near-end speech for a little longer time. The common solution to this problem is to slow down or completely halt the filter adaptation where the active near-end speech is detected [3], as is shown in Fig 1. The double-talk detector allows solving this problem. Several methods of double-talk detection have been proposed in the literature (see e.g., [4]-[9]). In this paper we present a novel method of double-talk detection based on feature extraction from far-end and near-end speech signals where a measure similarity is used to detect the double-talk periods. The remainder of this paper is organized as follows: Section II introduces double-talk detection. Section III discusses the proposed method. Section IV presents the results and Section V concludes the paper. II. DOUBLE-TALK DETECTION When the near-end speaker is active, or when the speech comes from both the far-end and near-end, the tap coefficients of the adaptive filter suffer a divergence from the optimum values. The purpose of the double talk detector (DTD) is to detect the segments of near-end speech and block the adaptation of the acoustic echo canceller [10]. Typically, the DTD calculates a detection statistic 𝜉 𝑛 , and double-talk is declared when 𝜉 𝑛 is lower than some threshold value 𝑇 [9]. The optimum decision variable 𝜉 𝑛 for double-talk detection will behave as follows: If 𝑠(𝑛) = 0 (double talk is not present), 𝜉 𝑛 ≥ 𝑇. If 𝑠(𝑛) ≠ 0 (double talk is present), 𝜉 𝑛 < 𝑇.
The control of the adaptive filter by DTD is defined as: 𝐶𝑜𝑛𝑡𝑟𝑜𝑙 =
𝜉 𝑛 ≥ 𝑇, 𝐷𝑇𝐷 = 0 𝑎𝑑𝑎𝑝𝑡𝑎𝑡𝑖𝑜𝑛 𝜇≠0 𝜉 𝑛 < 𝑇, 𝐷𝑇𝐷 = 1 𝑛𝑜 𝑎𝑑𝑎𝑝𝑡𝑎𝑡𝑖𝑜𝑛 𝜇 = 0
The well-known method of DTD is the Geigel algorithm [3], it compares the magnitude of near-end signal with the 𝐿 recent history of the far-end signal, where the variable statistic decision 𝜉𝐺 𝑛 of this latter is defined as: 𝜉𝐺 𝑛 =
𝑚𝑎𝑥 𝑥 𝑛
𝑥 𝑛 − 1 ,… 𝑥 𝑛 − 𝐿 + 1 𝑑(𝑛)
(10)
A method of DTD based on cross-correlation between the far-end and error signals is proposed in [5] and theirs approximate versions, such as a Normalized CrossCorrelation method (NCC) in [7],[11]. A statistic decision variable 𝜉𝑀𝐸𝐶𝐶 𝑛 denotes (Microphone Error CrossCorrelation decision) is defined by: 𝜉𝑀𝐸𝐶𝐶 𝑛 = 1 −
𝑟𝑑𝑒 𝜎𝑑2
(11)
where 𝑟𝑑𝑒 = 𝐸 𝑑 𝑛 𝑒(𝑛) is the cross-correlation between 𝑑(𝑛) and 𝑒(𝑛). 𝐸 . denotes the mathematical expectation and 𝜎𝑑 is the variance of 𝑑 𝑛 . This statistic decision is based on the estimates 𝑟𝑑𝑒 and 𝜎𝑑2 where are found using the recursive form: 𝑟𝑑𝑒 𝑛 = 𝜆𝑟𝑑𝑒 𝑛 − 1 + 1 − 𝜆 𝑒 𝑛 𝑑𝑇 𝑛
(12)
𝜎𝑑2 𝑛 = 𝜆𝜎𝑑2 𝑛 − 1 + 1 − 𝜆 𝑑 𝑛 𝑑𝑇 𝑛
(13)
where 𝜆 is the exponential weighting factor 0.9 < 𝜆 < 1 . III. PROPOSED METHOD In this paper, we proposed a new method of DTD based on speech characteristics extraction from the far-end and the microphone speech signals. The main purpose is to discriminate between these signals for sensing DT periods. This method is developed into a two stages: Feature extraction and similarity measure for DT decision. A. Feature extraction We use a frame 𝑓 of 𝑀 recent observation history of the far-end and the microphone normalized signals as: 𝑋𝑀 𝑛 = 𝑥 𝑛 𝑥 𝑛 − 1 , … 𝑥 𝑛 − 𝑀 + 1
𝑇
14
𝐷𝑀 𝑛 = 𝑑 𝑛 𝑑 𝑛 − 1 , … 𝑑 𝑛 − 𝑀 + 1
𝑇
15
For each frame we calculate a characteristic vector 𝑉(𝑛) contains four speech features: Energy 𝐸𝑛 (. ), standard deviation 𝑠𝑡𝑑(. ), maximum value 𝑚𝑎𝑥(. ), Log-energy 𝐸𝑠 (. ), is defined as:
𝑉 𝑛 = 𝛼1 𝐸𝑛 𝑓
IV. EXPERIMENTAL RESULTS 𝛼2 𝑠𝑡𝑑 𝑓
𝛼3 𝑚𝑎 𝑥 𝑓
𝛼4 + 𝐸𝑠 𝑓
𝛼4
16
where 𝛼1 , 𝛼2 , 𝛼3 , 𝛼4 are positive normalization parameters. In addition, the energy 𝐸𝑛 𝑛 of n-th frame 𝑓 is given by: 1 𝐸𝑛 𝑛 = 𝑀
𝑀
𝑓 2 (𝑖)
17
𝑖=1
where 𝑓 𝑖 is the i-th sample of the frame 𝑓, 𝑀 ≤ 𝐿. The other descriptor log-energy 𝐸𝑠 𝑛 of n-th frame 𝑓 is defined as:
𝐸𝑠 𝑛 = 10 log10
1 𝜖+ 𝑀
𝑀 2
𝑓 (𝑖)
18
𝑖=1
where 𝜖 is a small positive constant which has a value of 10−5 . B. Double-talk decision For detecting the double-talk periods, a variable decision 𝜉𝑆𝐹𝐸 𝑛 denotes (Speech Feature Extraction decision) is calculated for each iteration. It is defined by the Euclidian distance between the characteristic vectors 𝑉𝑥 (𝑛) and 𝑉𝑑 (𝑛) of the far-end frame and microphone frame respectively as: 𝜉𝑆𝐹𝐸 𝑛 = 1 −
𝑉𝑥 𝑛 − 𝑉𝑑 (𝑛) − 𝛽
𝜎𝑏2
(19)
where 𝑉𝑥 𝑛 − 𝑉𝑑 (𝑛) denotes the Euclidian distance between the two vectors 𝑉𝑥 𝑛 , 𝑉𝑑 (𝑛) and 𝛽 is a positive constant. 𝜎𝑏2 is the estimate variance value of background noise. Then, a dynamic threshold is used depending on near-end voice activity where a Voice Activity Detection (VAD) module based on the energy of the near-end frame is developed as: 𝑉𝐴𝐷 =
1, 𝑖𝑓 𝐸𝑛 𝑛 ≥ 𝑇𝑉𝐴𝐷 0, 𝑖𝑓 𝐸𝑛 𝑛 < 𝑇𝑉𝐴𝐷
where 𝑇𝑉𝐴𝐷 is the threshold value of VAD. The threshold value 𝑇𝑆𝐹𝐸 for DT decision is defined by: 𝑇𝑆𝐹𝐸 =
𝑇𝑆𝐹𝐸1 , 𝑖𝑓 𝑉𝐴𝐷 = 1 𝑇𝑆𝐹𝐸2 , 𝑖𝑓 𝑉𝐴𝐷 = 0
here 𝑇𝑆𝐹𝐸1 , 𝑇𝑆𝐹𝐸2 are constants threshold. The detection of double-talk by the proposed method is defined as: 𝐷𝑇𝐷 =
𝜉𝑆𝐹𝐸 𝑛 ≥ 𝑇𝑆𝐹𝐸 , 𝜉𝑆𝐹𝐸 𝑛 < 𝑇𝑆𝐹𝐸 ,
𝐷𝑇𝐷 = 0 𝐷𝑇𝐷 = 1
To evaluate the performance of the proposed method, we use the NLMS algorithm to update the filter coefficients in the total number of iterations 𝑁. The length of the frame 𝑓 is 512 samples. The acoustic echo path is modeled by a real impulse response of cockpit car was reformed of 𝐿 points and sampled at 16 KHz. The proposed method of DTD is compared to Geigel and NCC algorithms, where have been performed with the parameter values, which are explored in Table I. These parameters are determined by the trial and error method. TABLE I. Optimal values of AEC with DTD parameters.
Parameter
Value
Parameter
Value
𝜇
0.7
𝛼1
4
ε
2.2204 × 10−16
𝛼2
2
𝑁
40000
𝛼3
1
𝐿
1024
𝛼4
50
𝑀
512
𝛽
5
𝑇𝐺
0.72
𝑇𝑉𝐴𝐷
0.006
𝑇𝑁𝐶𝐶
0.92
𝑇𝑆𝐹𝐸1
0.7
𝜆
0.95
𝑇𝑆𝐹𝐸1
0.9
The test speech signals of the far-end and the near-end speakers are obtained from the TIMIT database [12], which are sampled at 16 KHz. Two measures of performance evaluation are used: Objective evaluation using two measure criteria: misalignment and Echo Return Loss Enhancement (ERLE), which are given as follows:
𝐸𝑅𝐿𝐸 𝑑𝐵 = 10 log10
𝑀𝑖𝑠𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡 𝑑𝐵 = 10 log10
𝐸 𝑦 𝑛 𝐸 𝑒 𝑛
2
(20)
2
𝑊(𝑛) − 2
2
(21)
Where 𝑊(𝑛) − is the Euclidian distance between the adaptive coefficients vector and the true echo path vector, ∙ denotes the Euclidian norm of a vector. The experimental results are shown in the following:
Fig 5 shows a comparison between the proposed DTD method and NCC algorithm, which can minimize a miss detection of double-talk periods by the proposed method.
2 0
Far-end signal
Near-end signal
Change in the echo path
1.5
-4
Geigel algorithm
1
-6
0.5 0 0
-8 -10
Amplitude
Misalignment (dB)
-2
NCC algorithm
-12
Proposed method
-14 -16 0
0.5
1
1.5
2
Iteration number
2.5
3
3.5
4 4
x 10
The curves of the misalignment in Fig 2 illustrate that the proposed method can improve the convergence of filter coefficients to the optimal values compared to the Geigel and the NCC algorithms. 40
Proposed method NCC algorithm
30
Geigel algorithm ERLE (dB)
20
10
0
-10
-20 0
100
200
300
400
500
Iteration x 64
600
700
Fig 3. ERLE evaluation.
Fig 3 shows the ERLE evaluation, which is evaluated for each 64 iterations. These results show that an improvement of the proposed method in term of large ERLE compared to the other methods.
0.5
1
1.5
1 0.5 0 0
Misalignment (dB)
Change in the echo path
Geigel algorithm
-4 -6 -8
NCC algorithm
-10
Proposed method
-12 -14
SNR=50 dB
-16 0
0.5
SNR=20 dB 1
1.5
2
Iteration number
SNR=10 dB 2.5
3
3.5
4 4
x 10
Fig 4. Misalignment evaluation in a noisy environment.
A white noise with zero mean is used as an ambient noise with a different Signal to Noise Ratio (SNR) for evaluating the proposed method in noisy environment as is depicted in Fig 4. This latter can also have a good performance in a noisy environment.
3
3.5
4 x 10
4
0.5
1
1.5
2
2.5
3
x 10
4
x 10
4
3.5
4
2
Near-end signal 0
0.5
1
1.5
2
Samples
2.5
3
3.5
4
Fig 5. DTD signals comparison.
To evaluate the performance of the proposed DTD algorithm, we have also used an objective evaluation scheme proposed in [13]. The criteria to evaluate DTD performance are as follows: Probability of False alarm 𝑃𝑓 : the probability of declaring detection when double-talk does not exist. Probability of Detection 𝑃𝑑 : the probability of successful detection when double-talk does exist. Probability of Miss 𝑃𝑚 = 1 − 𝑃𝑑 : the probability of detection failure when double-talk is present. The probability of false alarm at each threshold point is calculated with no near-end speech as 𝑃𝑓 =
𝑣𝑓 . 𝜙 𝑁𝑓
(22)
where 𝜙 is the DTD output, 𝑣𝑓 is the voice activity detector output of far-end speech, and 𝑁𝑓 is the length of entire far-end speech signal. Then, the near-end speech is applied, and the miss probability 𝑃𝑚 is calculated as 𝑃𝑚 = 1 −
Near-end signal
Far-end signal
-2
2.5
DTD Proposed method
2 0
2
1.5
-2 0
Fig 2. Misalignment evaluation in a clean environment.
DTD NCC Algorithm
𝑣𝑓 . 𝑣𝑛 . 𝜙 𝑣𝑓 . 𝑣𝑛
(23)
where 𝑣𝑛 is the voice activity detector output of near-end signal. A good detection method should maximize 𝑃𝑑 while minimizing 𝑃𝑓 even in a low SNR situation. In general, higher 𝑃𝑑 is achieved at the cost of higher 𝑃𝑓 . There should be a tradeoff in performance depending on the penalty or cost function of a false alarm and of a miss. The evaluation is based on taking the probability of miss 𝑃𝑚 as a function of Near-end to Far-end speech Ratio (NFR) under a given 𝑃𝑓 . This probability of false alarm is measured as the proportion of the far-end speech in which double-talk remains declared when there is no near-end speech [13]. For the test speech signals, the previous far-end signal is used. Four different sentences (two male, two female) were chosen for the near-end, each about 40000 samples
long. Different levels of NFR are generated by attenuating the near-end speech correspondingly. Geigel algorithm 0.65
NCC algorithm
0.6
Pm
0.55 0.5 0.45
Proposed method
0.4 0.35 -10
-5
0
NFR (dB)
5
10
15
Fig 6. Comparison of 𝑃𝑚 in terms of NFR (SNR = 20dB, 𝑃𝑓 = 0.15).
Fig 6 show 𝑃𝑚 under various NFRs in the range of -10dB to 15dB. It is clearly seen that the proposed method gives lower 𝑃𝑚 than with the NCC and Geigel algorithms. A perceptual evaluation of speech quality (PESQ) [14] is measured of the output speech for each method as is explored in Table II. These results indicate an enhancement of the speech intelligibility by the proposed method. TABLE II. Perceptual evaluation.
Method
PESQ
No echo cancellation
1.132
No double-talk detection
1.439
Geigel
1.742
NCC
1.751
Proposed
2.101 V. CONCLUSION
The goal of a DTD is to halt the update of filter coefficients during the double talk periods. In this paper we propose a new method of DTD based on speech feature extraction which allows to discriminating between far-end and microphone speech signals for declaring DT periods. The experimental results demonstrate a good performance of the proposed method in term of minimizing the misalignment, enhancing the ERLE, giving a low probability of miss detection and improving the transmitted speech intelligibility compared to Geigel and NCC algorithms. In future works, we plan to use a voiced/unvoiced classification of speech to improve the double-talk decision and we combine frequency and temporal speech feature extraction for increasing the abilities of discriminating.
REFERENCES [1] M. Hamidia, A. Amrouche, “Influence of noisy channel on acoustic echo cancellation in mobile communication,” In proceeding of IEEE, International Conference on Microelectronics (ICM), pp. 1-4, 2012. [2] S. Haykin, “Adaptive filter theory,” Third edition, Prentice Hall Inc, New York, 1996. [3] T. A. Vu, H. Ding, M. Bouchard, “A survey of double-talk detection schemes for echo cancellation application,” Canadian acoustic journal, Vol. 32, N°. 3, pp. 144-145, 2004. [4] D. Duttweiler, “A twelve-channel digital echo canceler,” IEEE Trans on Communications, Vol. 26, N°. 5, pp. 647653, 1978. [5] H. Ye and B.-X. Wu, “A new double-talk detection algorithm based on the orthogonality theorem,” IEEE Trans on Communications, Vol. 39, N°. 11, pp. 1542-1545, 1991. [6] T. Gander, M. Hansson, C. -J. Ivarsson, G. Salomonsson, “A double-talk detector based on coherence,” IEEE Trans on Communications, Vol. 44, N°. 11, pp. 1421-1427, 1996. [7] J. Benesty, D. R. Morgan, J. H. Cho, “A new class of doubletalk detectors based on cross-correlation,” IEEE Trans on speech and audio processing, Vol. 8, N°. 2, pp. 168-172, 2000. [8] P. Åhgren, “Acoustic echo cancellation and double-talk detection using estimated loudspeaker impulse responses,” IEEE Trans on speech and audio processing, Vol. 13, N°. 6, pp. 1231-1237, 2005. [9] C. Schüldt, F. Lindstrom, I. Claesson, “A delay-based double-talk detector,” IEEE Trans on speech and audio processing, Vol. 20, N°. 6, pp. 1725-1733, 2012. [10] I. J. Tashev, “Coherence based double talk detector with soft decision,” In proceeding of IEEE, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 165-168, 2012. [11] M. Iqbal, J. Stokes, and S. Grant, “Normalized double-talk detection based on microphone and AEC error cross correlation,” In proceeding of IEEE, International Conference on Multimedia and Expo (ICME), pp. 360-363, 2007. [12] W. Fisher, V. Zue, J. Bernstein, D. Pallet, “An acousticphonetic data base,” Journal of the Acoustical Society of America, Vol. 81, N°. 1, pp. S92–S93, 1987. [13] J. H. Cho, D. R. Morgan, and J. Benesty, “An objective technique for evaluating doubletalk detectors in acoustic echo cancellers,” IEEE Trans on Speech Audio Processing, Vol. 7, N°. 6, pp.718-724, 1999. [14] ITU-T P.862, “Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” 2002.