Speech/voice-band data classification for data traffic ... - IEEE Xplore

7 downloads 0 Views 118KB Size Report
Abstract—A noise-insensitive technique for the discrimina- tion between speech and voice-band data transmissions over a telephone-type system is presented.
IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 49, NO. 2, APRIL 2000

413

Speech/Voice-Band Data Classification for Data Traffic Measurements in Telephone-Type Systems Luigino Benetazzo, Matteo Bertocco, P. Paglierani, and E. Rizzi

Abstract—A noise-insensitive technique for the discrimination between speech and voice-band data transmissions over a telephone-type system is presented. The proposed procedure combines together the evaluation of some meaningful parameters extracted from the observed speech or data communication signal, with an analysis of the statistical moments associated to these parameters. In this way, the discrimination capability associated to the extracted parameters is enforced by the moment analysis that better exploits the time-varying properties of speech signals. Experimental results, that show the good behavior of the classifier in terms of probability of error with respect to signal-to-noise ratio (SNR) are reported and discussed. Index Terms—Communication system maintenance, communication system traffic, measurement, pattern classification, signal processing, telephony.

I. INTRODUCTION

T

HE widespread diffusion of digital technologies in communication systems and the increasing demand for highquality links have brought the need for thorough and frequently performed characterizations of telecommunication networks. Recommendations issued by authoritative international organizations, such as the International Telecommunication Union (ITU), suggest to assess the degree of quality provided to the users when the lines are actually in use (nonintrusive measurements) [1], [2]. To this aim, some significant performance parameters should be evaluated by suitably processing the speech signals observed during a telephone call [1], [2]. Such parameters have been designed by international committees having in mind a speech signal and trying to describe the speech signal itself or the physical impairments that disturb a human listener. For this reason, they describe very well the degree of quality of a voice communication, but they are not suitable for the case of data communication, and hence they provide little or even misleading information in this latter case. An important issue is therefore the discrimination between speech and voice-band data transmissions. A second fundamental application of speech/voice-band data discrimination is when data traffic measurements are involved. In this case, telecommunication companies are interested in finding out the percentage of data communication out of the total. In fact, such a measurement allows a better planning of new networks as well as the upgrade of already existing ones Manuscript received May 26, 1999; revised January 31, 2000. L. Benetazzo and M. Bertocco are with Dipartimento di Elettronica e Informatica Università di Padova, A-35131 Padova, Italy. P. Paglierani is with Italtel, Settimo Milanese (MI), Italy. E. Rizzi is with Tektronix Padova S.p.A., 28-35127 Padova, Italy. Publisher Item Identifier S 0018-9456(00)03436-7.

according to the actual traffic requirements. Other reasons make voice/data discrimination interesting to a telecommunication company. For instance, network providers may apply different tariff policies to voice-only or data-only links; in this case, the check for the actual use of the network should be performed, to the aim of appropriately invoicing the user. The problem of classifying telephone-type signals has been investigated by researchers and some techniques are available [3]–[5]. One should note that, in most cases, the accuracy of the quoted algorithms is strictly related to the time available for the decision and to the degree of distortion of the observed signal, caused by the typical impairments of the network (i.e., wide-band noise, impulsive noise, or echo). For instance, the real-time decision algorithm proposed in [3] exhibits a probability of wrong classification not better than 1% for a clean signal; other proposed techniques can reach a much better accuracy, but they are quite sensitive to the various impairments usually found in telecommunication lines. Unfortunately, the discrimination performances provided by the above techniques are unsatisfactory in many cases of practical interest. In particular, they can provide poor information when adopted in order to monitor long distance communication circuits or modern cellular networks, due to the rather high degree of distortion that such systems can introduce. In order to overcome these limitations, in this paper a new measurement technique is presented for the discrimination of a speech communication from a voice-band data communication under poor signal-to-noise ratio (SNR) conditions. The proposed procedure combines a pattern recognition approach based on the evaluation of some parameters, together with an analysis of the statistical moments associated to the parameters themselves. In this way the discrimination capability associated to the extracted parameters is enforced by the moment analysis, that better exploits the time-varying properties of the voice signal. This technique has been shown to be insensitive to noise and to the other various impairments found over the telecommunication network and to provide good results in terms of probability of correct decision (even under low SNR conditions). In the next section, the algorithm for voice/data classification is presented while in the Section III experimental results are reported which show the performances of the algorithm. II. VOICE/DATA CLASSIFICATION ALGORITHM Modern telecommunication networks are designed in such a way as to easily provide test points at which a transmitted signal can be observed [10]. At these points, after very simple

0018–9456/00$10.00 © 2000 IEEE

414

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 49, NO. 2, APRIL 2000

Fig. 1. Block scheme for the speech/data discrimination algorithm.

manipulations, one can assume that a sampled sequence available [7]

is

(1) is the information signal, speech, or voice-band modwhere represents a noise sequence, while the lag ulated data, corresponds to the time , where is the sampling period equal to 125 s, and is an arbitrary reference time that represents the beginning of measurement. The proposed classification procedure can be summarized as shown in Fig. 1. In the frame analysis block of Fig. 1, one record of the input signal having fixed duration is considered (2) denotes the th analyzed frame, while where the index has been chosen to be equal to 256 samples. This corresponds to an observation duration of 32 ms at a sampling period s. This lag ensures, for the case of a speech signal, a quasistationarity condition [8]. If the power of the considered th frame is greater than a suitable low threshold, that represents a silence frame, then a set of parameters is extracted. The parameters are defined in Appendix A. They have been chosen from the ones described in [8], by selecting the subset of parameters from all the possible ones that yield a good voice/data discrimination with a low reduction of information with respect to the full set [9]. In the moment analysis block of Fig. 1, the array of parameobtained at the th analysis ters frame is regarded as a realization of a random column vector , where is the random variable as. Thus, by using successive sociated to the th parameter sucrealizations of the random vector , i.e., by observing cessive analysis frames, the central moments up to the order are estimated. Hence, the new vector is formed

(3) , is an estimate of the th where order central moment of the random variable . It should be noticed that the vector is evaluated once every frames of the input signal ; this corresponds to a time lag ms, which is equal to the time interval between of two successive voice/data decisions. should be regarded as a design parameter. A greater value of yields a longer decision time, and decreases the probability of wrong classification. A value that has given good results at a reasonable time-lag for , that the case of telephone-type communications is corresponds to a measurement time of 6 s.

Fig. 2. Typical signals associated to (a) speech and (b) voice-band data, and respective third reflection coefficient tendency (c) and (d) after the frame analysis.

The third block of Fig. 1 (decision block) consists of a linear transformation followed by a threshold comparison. In order to design the voice/data classifier a training set of telephone-type signals has been created in order to account for most of the possible types of signals encountered in practice. Such training set has been built in such a way that utterances pronounced by different speakers are recorded, yielding a sufficiently large number of speech stretches representing a typical voice phone-call. Moreover, all the main modulation techniques for data communication have been used, and for each of them different stretches of data signals have been considered. More precisely, the training set database contains: • 1500 stretches of speech of different speakers and different languages, some recorded from the network (and thereby corrupted by typical impairments like noise and echo) and others recorded in a controlled environment (and therefore without impairments); • 1500 stretches of data-type signals artificially generated or recorded from the network and having a bit-rate between 2400 and 14 400 bit/s. One should note that the time duration associated to the entire database is 6 h; this gives the reader a rough idea of the degree of significance of such a training set. Fig. 2(a) and (b) shows meaningful samples of signals associated to a voice communication and to a data communication respectively; in particular it should be noted the different raw shape during the same time interval. Once the above quoted database of signals has been built, a large set of parameters (i.e., the ones found in [8]) has been extracted from each frame, and the respective moments have been evaluated giving 3000 realizations of the vector . Then all the possible subsets of the extracted parameters (and thereby the subsets of the respective moments) have been considered, and for each of them a Bayesian classifier has been built according to the techniques found in [9]. The subset providing the best

BENETAZZO et al.: SPEECH/VOICE-BAND DATA CLASSIFICATION FOR DATA TRAFFIC MEASUREMENTS

415

TABLE I MOMENT ANALYSIS OF THE SIGNALS REPORTED IN Fig. 2(a) AND (b)

trade-off between probability of misclassification and number of extracted parameters has been selected. By using the same database described above, an evaluation of the moments associated to the chosen parameters has been performed. The 3000 realizations of the vector , obtained by the analysis of the training set, have been used for the design of a linear classifier. Standard techniques for such a design found in Chapter 9 of [9] and in [10] have been applied. This has led to the determination of a linear transformation matrix and a and the vector are threshold such that the scalar equally effective in separating speech from data, and the probability of wrong classification of the classifier is minimized. Finally, in the decision block of Fig. 1 the index is compared with the threshold obtained from the training set; this comparison gives the final voice/data decision [8]. One should note that the application of a Bayesian classifier parameters instead of their moments that uses just the would provide poor performance under low SNR conditions. On the other side, a moment analysis applied directly to the input signal would provide poor results due to the complicated intrinsic structure of a speech or data signal under test. Only the use of both the above methods together has been seen to give a robust and accurate classification procedure. As a confirmation of this fact, Fig. 2(c) and (d) shows the typical values assumed , i.e., the third reflection coeffiby one of the parameters cient, during the frame analysis for the case of voice and data are extracted from respectively. Please remember that the , and hence the observation of samples of the input signal is ms. By comparison, the sampling rate of it can be easily seen that a better separability between the two classes of signals is achieved after the moment analysis, as reported in Table I. For a better understanding, Table I provides the numerical , where the index values associated to the coefficients refers to the third reflection coefficient only and , for the case of the telephone-type signals reported in Fig. 2(a) and (b). One should note that the th stretch of Table I refers to s of Fig. 2(a) and (b). the observation interval III. EXPERIMENTAL RESULTS Several specific tests have been performed in order to verify the significance of the chosen parameters and the behavior of the algorithm so tuned. To this aim a new test set of telephone-type signals has been formed, with the same characteristics of the training set described in Section II but uncorrupted by noise. Both computer-generated wide-band noise and sampled stretches of noise collected from the network

Fig. 3. Estimated PDF associated to the scalar index .

Fig. 4.

Performance of the classifier with respect to SNR.

have been successively added to the test signals. In this way it was possible to verify the performances of the procedure with respect to the amount of noise [8], [11]. In this context the SNR is referred to as the ratio between the mean power of a clean signal and the mean power of the superimposed distortion. In particular, the power level of the speech signals has been evaluated according to international recommendations [12], while for the data signals the power associated to the actual constellation has been used [6]. The voice/data classification procedure has been then verified under different SNR conditions, and the probability of wrong classification has been evaluated. In more than 10 decisions performed on the test set, with SNR in the range of 0–40 dB, the algorithm never gave a wrong speech decision when a data communication signal was transmitted; while cases of wrong data decision, when a speech signal was observed, have been noticed. Fig. 3 shows the estimated probability density function's (PDF's) associated to the scalar index for the considered test set of signals at an SNR value of 36 dB, which is a typical value for a telephone-type link. It should be noted that the two distributions are well separated and unimodal, confirming the capability of the index of representing the type of the considered signal and giving a correct decision with a low error probability. as a function of SNR; it is Fig. 4 reports the behavior of increases as the value of the SNR becomes lower. seen that

416

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 49, NO. 2, APRIL 2000

This behavior can be justified by recalling that the noise added to the clean voice signal has characteristics similar to a data-type signal, in terms of stationarity and predictability, so that, when the noise level becomes comparable to the speech level itself, a wrong decision may be taken with a nonnegligible probability. One should notice that for SNR values lower than 20 dB a network link should be considered useless [13], hence confirming the good behavior of the classifier. In any case if one would at very bad SNR values, a like to obtain lower values of rough estimation of the SNR [14] can be performed while the threshold can be adapted. Some further experiments indicated could be decreased down to 6% for dB. that Furthermore, it is recalled that by increasing the measurement duration (i.e., the parameter ) better results can be easily achieved.

APPENDIX B DESCRIPTION OF THE EVALUATED MOMENTS the th element of the array perBy denoting with taining to the th analysis frame, the ensemble moments of (3) are defined as (B1)

(B2)

(B3) IV. FINAL REMARKS The presented algorithm has been implemented in a digital signal processor (DSP) based instrument. A 33 Mflop processor has been used, equipped with 1 megabyte of static RAM and 4 megabytes of dynamic RAM. Fewer than 20 kilobytes of RAM for storing signal records and other variables are required, while the code occupies approximately 15 kilobytes. With the adopted processor, up to six channels can be simultaneously monitored. This instrument is currently used by several telephone companies for the measurement of the percentage of voice and data communications over the total number of communications.

SET

APPENDIX A OF THE PARAMETERS ADOPTED IN THE FRAME ANALYSIS BLOCK

The parameters extracted by the frame analysis block of Fig. 1, and obtained after the observation of the th frame of signal (1), are

(A1) and are, respectively, the third and fourth reflecwhere and tion coefficients provided by the Le Roux algorithm [4]. are the estimates of the second and third coefficients of the auto-correlation function, that for the th frame is defined as

(A2) and are, respectively, the number of times that the signal and its first difference cross their sample mean value is the maximum magniduring the observed frame, and tude of the signal during the frame.

are, respectively, mean, variance, skewwhere is the ness, and kurtosis associated to the parameter, while Allan variance [11], [15] and is the number of analysis frames considered. ACKNOWLEDGMENT The authors wish to thank Tektronix Padova S.p.A., Monitoring and Protocol Test Product Line, for supporting this work. REFERENCES [1] “In-Service, Nonintrusive Measurement Device—Voice Service Measurements,” International Telecommunication Union, Geneva, ITU-T Recommendation P.561, 1996. [2] American National Standard for Telecommunication In-service, NonIntrusive Measurement Device (INMD) Voice Service Measurements [3] N. Benvenuto, “A speech/voiceband data discriminator,” IEEE Trans. Commun., vol. 41, pp. 539–543, Apr. 1993. [4] D. R. Irvin, “Voice/data detector and discriminator for use in transform speech coders,” IBM Tech. Disclosure Bull., vol. 26, pp. 363–365, June 1983. [5] C. Roberge and J. P. Adoul, “Fast on-line speech/voiceband-data discrimination for statistical multiplexing of data with telephone conversation,” IEEE Trans. Commun., vol. COM-34, pp. 744–751, Aug. 1986. [6] K. Feher, Telecommunications Measurements, Analysis, and Instrumentation. Englewood Cliffs, NJ: Prentice-Hall, 1987. [7] “Pulse Code Modulation of Voice Frequencies,” International Telecommunication Union, Geneva, ITU-T Recommendation G.711, 1993. [8] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978. [9] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, 1972. [10] L. Benetazzo, M. Bertocco, C. Offelli, and P. Paglierani, “Speech/voice-band data classifier for data traffic measurement in telephone-type systems,” in Imeko World Congr., vol. 1, June 1997, pp. 92–96. [11] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed. New York: McGraw-Hill, 1991. [12] “Objective Measurements of Active Speech Level,” International Telecommunication Union, Geneva, ITU-T Recommendation P.56, 1993. [13] M. B. Carey, H. T. Chen, A. Descloux, J. F. Ingle, and K. I. Park, “1982/83 End office connection study: Analog voice and voice-band data transmission performance characterization of the public switched network,” AT&T Bell Lab. Tech. J., vol. 63, no. 9, Nov. 1984. [14] “Telephone Transmission Quality—Vocabulary and Effects of Transmission Parameters on Customer Opinion of Transmission Quality and their Assessment,” International Telecommunication Union, Geneva, ITU-T Recommendation P.11, 1993.

BENETAZZO et al.: SPEECH/VOICE-BAND DATA CLASSIFICATION FOR DATA TRAFFIC MEASUREMENTS

[15] Calibration: Philosophy in Practice, 2nd ed: Fluke. [16] J. Le Roux and C. Gueguen, “A fixed point computation of partial correlation coefficient,” IEEE Trans. Acoustic, Speech, and Signal Processing, pp. 257–259, June 1977. [17] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561–580, Apr. 1975.

Luigino Benetazzo was born in 1938. He received the Laurea degree in electronic engineering (cum laude) in 1962 from Padova University, Padova, Italy. Since 1976, he has been teaching at the Faculty of Engineering as Full Professor of Electronic Measurement at Padova University. He has authored 100 papers and, as a President (1992–1997) of CSELT, the telecommunications research laboratory of Telecom Italia, Turin, he has been involved in national and international research programs on advanced telecommunication services, in cooperation with leading research centers on telecommunication all over the world; as past President (1987–1992) of NECSY company that designs and manufactures products for the telecommunication market in joint-venture with Hewlett-Packard Company. He is now responsible for a national research project on “Measurement systems based on complex architectures: Instruments, characterization, and qualification of measurement process.”

Matteo Bertocco was born in Padova, Italy, in 1962. He received the Laurea degree in electroics engineering from the University of Padova, where he subsequently received the Ph.D. degree in electronics engineering in 1991. Since 1994, he has been a Researcher at the Department of Electronics and Informatics, University of Padova, becoming an Associate Professor of Electronic Instrumentation and Measurement in 1998. His research interests are in digital signal processing, estimation, automated instrumentation, and electromagnetic compatibility.

417

P. Paglierani was born in Pordenone, Italy, in 1965. He received the “Laurea” degree in electronic engineering and the Ph.D. degree in electronic instrumentation from the University of Padua, in 1992 and 1998, respectively. He is currently with Italtel S.p.A, Milan, Italy, where he is involved in the development of speech processing applications for telecommunication systems. Previously, he was with Necsy and with Tektronix Padova, where he was designer of digital instrumentation to be used in telecommunication systems.

E. Rizzi was born in Cavizzana, Italy, in 1972. He received the “Laurea” degree in electronic engineering from the University of Padova, Padova, Italy, in 1997, and since January 1998, he has been pursuing the Ph.D. degree in electronic instrumentaion and measurements. Recently, he joined Tektronix Padova where he is involved in the development of digital signal processing applications and in the design of measurement equipment for quality assessment in telecommunication network.

Suggest Documents