Speech Quality Assessment Based on Virtual Instrumentation Radek Martinek, Radana Kahankova, Petr Bilik, Jan Nedoma, Marcel Fajkus, Petr Blaha Faculty of Electrical Engineering and Computer Science, VSB - Technical University of Ostrava 17. listopadu 2172/15, 708 00 Ostrava-Poruba, Czech Republic
[email protected],
[email protected],
[email protected] ,
[email protected],
[email protected] ,
[email protected] The structure of the paper is following: the first chapter describes the adaptive algorithms used in this paper. Second section introduces the method called dynamic time warping (DTW) [14], [15], [16] and [17], which will help with the recognition of the recordings and evaluation the quality of filtration using adaptive filters. Finally, we describe the experiments carried out with the implemented speech recognition system.
ABSTRACT This paper introduces a program for objective and subjective evaluation of speech quality. Using this environment, a lot of speech recordings and various indoor and outdoor noises were processed. As a subjective speech evaluation method, the Dynamic time warping (DTW) method was selected, with PARCOR coefficients being chosen as symptom vectors. For the filtration of the noise in the recording, adaptive filtering based on LMS and RLS algorithms was used and the performance of the adaptive filtering was assessed. Similarity ranged from 70% to 95% for both algorithms. In terms of signal to noise ratio, the RLS algorithm ranged from 36 dB to 42 dB, while the LMS algorithm only varied from 20 dB to 29 dB.
2. ADAPTIVE ALGORITHMS 2.1 Standard LMS The standard LMS algorithm performs the following operations to update the coefficients of an adaptive filter:
CCS Concepts • Software and its engineering → Software notations and tools → Software configuration management and version control systems.
Calculates the output signal y(n) from the adaptive filter.
Calculates the error signal e(n) by using the following equation: ( )
Keywords Signal to Noise Ratio; Voice Activity Detector; Dynamic Time Warping (DTW); Partial Correlation Coefficients; Adaptive Filter; Least Mean Squares; Recursive Least Squares.
( )
( )
(1)
Updates the filter coefficients by using the following equation: (
)
( )
( ) ( )
(2)
where μ is the step size of the adaptive filter, w(n) is the filter coefficients vector, and x(n) is the filter input vector. Step size is a crucial parameter that can improve the convergence speed of the adaptive filter. It determines both how quickly and how closely the adaptive filter converges to the filter solution.
1. INTRODUCTION Speech processing, specifically computer analysis, is used for solving a wide range of communication technology issues, see [1], [2], and [3]. In particular we can mention the detection of speech in a noisy environment or speech recognition and conversion to text. That could, for instance, help blind people with the creation of documents. Moreover, voice operated home appliance control system is still being developed and refined [4], [5], and [6].
2.2 Standard RLS The standard RLS algorithm performs the following operations to update the coefficients of an adaptive filter:
The aim of the paper is to introduce our program, which can be used for recording speech and noise recordings [7], [8], from which the quality of speech will be assessed using objective and subjective methods. For the filtration of the noise, adaptive filters based on LMS and RLS algorithms were utilized, see [9], [10], [11], [12], and [13]. Their efficiency is assessed in the paper.
Calculates the output signal of the adaptive filter according to the following formula: ( )
) ( )
(3)
Determines error estimate e (k) using the equation: ( )
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. ICCMS 2018, January 8–10, 2018, Sydney, Australia © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6339-6/18/01…$15.00
( ( )
( )
(4)
Updates the filter coefficients according following equations: (
)
( )
( ) ( )
(5)
where k(n) is the filter coefficient vector, K(n) is a gain vector that is defined as: ( )
( ) ( ) ( ) ( ) ( )
(6)
where P (k) is the inverse correlation matrix of the input signal and its initial value P (0) has the following form:
https://doi.org/10.1145/3177457.3177459
49
( )
[
]
The individual elements of the matrix G are recursively calculated according to the formula:
(7)
, ( ) ( )-
)
( )
( )
( ) ( )
(
). (
)
)
{ (
)-
( )
(
The choice of the forgetting factor depends on the number of samples k according to the following relation:
(16)
)
) (
(
(17)
) )
The final normalized distance between images A and B can be calculated according to:
(9)
.
) (
where k = 1, ..., K ad [i (k), j (k)] is the appropriate local distance from matrix D. If a weighting function, then:
(8)
Repeats all of these steps for next iteration (
* ,(
, ( ) ( )- ̂ ( )+
where δ is the regulatory factor. The standard RLS algorithm uses the following equations to update the inverse correlation matrix: (
* ( ) ( )+
If the signal being analyzed is stationary, the λ value should be selected as one. Otherwise, it should be less than one
(
)
, ( ̂ )-
, ( ) ( )-
[ ( ̂ )]
,
-.
(18)
If D is equal to 0, it means that the images A and B are identical, otherwise it will be 1, [17], [18].
3. DYNAMIC TIME WARPING (DWT) The Dynamic Time Warping (DTW) method uses the dynamic programming principle, see Figure 1. Although this method has been developed long time ago, it is still being used for its simplicity. It is used to recognize isolated words (signals) or short sections. This algorithm is mainly used for speech recognition. As a pattern for recognition, reference recording is used in most cases. Using the algorithm, it is possible to compare the pronounced word (range of words) with a given reference record and calculate the distance of the DTW path. To reach the best possible results, the distance should be as little as possible. The paper will deal with the use of this method for the classification and evaluation of speech recordings that were processed by filters based on the LMS and RLS algorithms. Each signal is expressed by a sequence of vectors. As the test signal we will consider the signal A: * ( ) ( ) ( )
( )
( )+
(10)
Figure 1. Dynamic Time-Warping on two 1D signals.
4. EXPERIMENTS
and the reference signal denoted as B: * ( ) ( ) ( )
( )
( )+
To isolate words alone by using the DTW method, the recordings of a person counting from one to ten by the same speaker in Czech language. Out of these words, a library, from which the user can load any specific words, was created. During the calculations, the user can change the selected word from the library, and this word will be considered as a reference.
(11)
This means that a(n) denotes the n-th vector of the test signal A and b(m) is the m-th vector of the reference signal B. DTW searches the optimal path in the plane (n, m) defined as: ( )
Figure 2 shows the reference word ‘čtyři’ (i.e. four in Czech language) and the same signal with the additive noise. For the experiments, the detector based the DTW method was used to compare the reference speech recordings with the recordings after filtration to evaluate their performance. Table 1 shows a comparison of the individual reference words with recordings that were filtered using adaptive filtering based on the LMS algorithm, which had a set convergence constant μ = 0.0003 and the filter order was set to 5. The second part of the table is the comparison of the references with the filtered speech using the RLS algorithm with the forgetting factor of set to 1. The optimal settings of the algorithms were determined experimentally according to the value of SNRout.
(12)
This path, minimizes the D function defined as the distance between the vectors A and B: (
)
∑ ̂ [ ( ) ( ( ))]
(13)
where ̂ , ( ) ( ( ))- is the local distance between the n-th test vector and the reference word vector m. The time variable k is introduced. The time variables and are defined as: ( )
( )
(14)
If there are two signals with known start and end points, we can express the limitation of the DTW function by: ( ) ( )
( ) ( )
(15)
50
Figure 2. The signals used for the experiments and the evaluation of the adaptive systems. Table 1. Results for each algorithm. LMS Word
RLS
D (-)
D (%)
SSNR (dB)
D (-)
D (%)
SSNR (dB)
Jedna
0,2193
78,07
23,84
0,2722
72,78
41,40
Dva
0,2827
71,73
25,14
0,3416
65,84
36,86
Tři
0,0816
91,84
28,29
0,1042
89,58
39,79
Čtyři
0,2036
79,64
25,89
0,1631
83,69
40,41
Pět
0,2421
75,79
21,12
0,3027
69,73
36,39
Šest
0,1597
84,03
20,06
0,1416
85,84
36,34
Sedm
0,0511
94,89
27,03
0,1220
87,80
41,72
Osm
0,2145
78,55
23,96
0,2260
77,40
38,14
Devět
0,1649
83,51
28,62
0,2498
75,02
40,69
Figure 3. Comparison of the filtered signals with the reference. Figure 4 shows the same signals in the frequency domain. In both domains, the noise was visibly suppressed after applying LMS and RLS algorithms. However, these results show that RLS algorithm was more effective in eliminating the noise component. That does not fully correspond to the information in Table 1, where the LMS algorithm outperformed RLS algorithm according to the D value. On the other hand, according to the SSNR, the RLS algorithm was more effective. This disagreement was probably caused by the inaccuracy of the detector.
When evaluating the similarity of filtered speech, two adaptive filter algorithms were used. The first of them, LMS algorithm, was according to the results in Table 1 similar from 75% to 95% with the reference signal. The second algorithm for evaluating the similarity was the RLS algorithm. Its similarity values were almost the same as in the LMS, but slightly lower. In terms of SSNR, the RLS algorithm, whose values went up by 20 dB, achieved better results. Figure 3 shows the output signals in time domain for visual evaluation.
51
[3] Massaro, D. W. (Ed.). (2014). Understanding language: An information-processing analysis of speech perception, reading, and psycholinguistics. Academic Press. [4] Jeet, Vikram, Hardeep Singh Dhillon, and Sandeep Bhatia. "Radio frequency home appliance control based on head tracking and voice control for disabled person." Communication Systems and Network Technologies (CSNT), 2015 Fifth International Conference on. IEEE, 2015. [5] Sen, Sonali, et al. "Design of an intelligent voice controlled home automation system." International Journal of Computer Applications 121.15 (2015). [6] Vanus, J., Smolon, M., Martinek, R., Koziorek, J., Zidek, J., & Bilik, P. (2015). Testing of the voice communication in smart home care. Human-Centric Computing and Information Sciences, 5(1), 1-22. [7] Narayanan, A., & Wang, D. (2014, May). Joint noise adaptive training for robust automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 2504-2508). IEEE. [8] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015, August). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation (pp. 91-99). Springer, Cham. [9] Martinek, R., Kelnar, M., Vanus, J., Koudelka, P., Bilik, P., Koziorek, J., & Zidek, J. (2015, July). Adaptive noise suppression in voice communication using a neuro-fuzzy inference system. In Telecommunications and Signal Processing (TSP), 2015 38th International Conference on (pp. 382-386). IEEE.
Figure 4. The results of the comparison in the frequency domain.
[10] Kahankova, R., Martinek, R., & Bilik, P. (2017, May). Fetal ECG extraction from abdominal ECG using RLS based adaptive algorithms. In Carpathian Control Conference (ICCC), 2017 18th International (pp. 337-342). IEEE.
5. CONCLUSION In this article, word recognition algorithm by the DTW timing dynamic method was tested. To assess similarity by using this method, the speech detector and its correct setting are important to avoid speech recognition. In the experiments, the reference recordings were recognized and compared with recordings filtered using the LMS and RLS algorithms. This way, the performance of the adaptive filtering was assessed. Similarity ranged from 70% to 95% for both algorithms, which is a sufficient similarity to recognize words from the dictionary used. The results of the algorithms differed in value of signal to noise ratio, where the RLS algorithm ranged from 36 dB to 42 dB, while the LMS algorithm only varied from 20 dB to 29 dB.
[11] Martinek, R., Vanus, J., Bilik, P., Stratil, T., & Zidek, J. (2016). An Efficient Control Method of Shunt Active Power Filter Using ADALINE. IFAC-PapersOnLine, 49(25), 352357. [12] Kahankova, R., Martinek, R., & Bilik, P. (2016, November). Non-invasive Fetal ECG Extraction from Maternal Abdominal ECG Using LMS and RLS Adaptive Algorithms. In International Afro-European Conference for Industrial Advancement (pp. 258-271). Springer, Cham. [13] Martinek, R., Kahankova, R., Nazeran, H., Konecny, J., Jezewski, J., Janku, P., ... & Fajkus, M. (2017). Non-Invasive Fetal Monitoring: A Maternal Surface ECG Electrode Placement-Based Novel Approach for Optimization of Adaptive Filter Control Parameters Using the LMS and RLS Algorithms. Sensors, 17(5), 1154.
6. ACKNOWLEDGMENTS The research received a financial support from the SGS grant No. SP2017/128, VSB-Technical University of Ostrava, Czech Republic.
7. REFERENCES
[14] Keogh, Eamonn J., and Michael J. Pazzani. "Derivative dynamic time warping." Proceedings of the 2001 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2001.
[1] Martinek, R., Kelnar, M., Vanus, J., Bilik, P., & Zidek, J. (2015). A robust approach for acoustic noise suppression in speech using anfis. Journal of Electrical Engineering, 66(6), 301-310.
[15] Keogh, E., & Ratanamahatana, C. A. (2005). Exact indexing of dynamic time warping. Knowledge and information systems, 7(3), 358-386.
[2] Martinek, R., & Zidek, J. (2010, September). Use of adaptive filtering for noise reduction in communications systems. In Applied Electronics (AE), 2010 International Conference on (pp. 1-6). IEEE.
[16] Keogh, E. J., & Pazzani, M. J. (2001, April). Derivative dynamic time warping. In Proceedings of the 2001 SIAM
52
International Conference on Data Mining (pp. 1-11). Society for Industrial and Applied Mathematics.
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on (pp. 2504-2508). IEEE.
[17] Salvador, S., & Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5), 561-580.
[20] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015, August). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation (pp. 91-99). Springer, Cham.
[18] Berndt, Donald J., and James Clifford. "Using dynamic time warping to find patterns in time series." KDD workshop. Vol. 10. No. 16. 1994. [19] Narayanan, A., & Wang, D. (2014, May). Joint noise adaptive training for robust automatic speech recognition. In
53