a novel output-based objective speech quality measure for wireless ...

35 downloads 79 Views 245KB Size Report
Hong Kong University of Science and Technology ... the case in the cellular phone situation, and the in- ... wireless phone, here "good" means the MOS Mean.
A NOVEL OUTPUT-BASED OBJECTIVE SPEECH QUALITY MEASURE FOR WIRELESS COMMUNICATION O.C. Au*, K.H. Lam

Department of Electrical and Electronic Engineering Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong, China Tel: +852 2358-7053 Fax: +852 2358-1485 Email: [email protected]

ABSTRACT

new subscribers and changes in existing user patterns. our previous work[2], we studied the performance In our previous papers, we studied many input-to-output In of many input-to-output measures such as signal-toobjective speech quality measures, some of which achieved noise ratios, LPC-based measures, spectral distance high correlation when used to predict subjective mean measures, and other psychoacoustic speech measures. opinion scores (MOS) of real cellular phone speech samSome measures such as Bark spectral distance and Mel ples. Two problems of input-to-output measures are spectral distance were found to be better than others that the input must be available, which is almost never when used to estimate the mean opinion score (MOS) of the case in the cellular phone situation, and the inthe speech quality of the cellular network. One problem put must be accurately synchronized with the output. we encounted was that, in these input-to-output meaOutput-based measures which do not need the input sures, time-alignment between the input and output are thus highly desirable. In this paper, we propose speech vector is of paramount importance in order to two output-based objective speech measures which are compute meaningful distance measure . In reality, perbased on visual features of the spectrogram. In our fect synchronization is dicult to achieve due to fading experiment, one measure OBM1 achieves a correlation or bursty error in the wireless environment. Manual syof 0.65 which is higher than most input-to-output meachronization was used in [2], but it was extremely slow sures and is close to the 0.73 achieved by the best inputand costly. Automatic synchronization must be used to-output measure. . to be realistic, but would inevitably lower the correlation, or the e ectiveness of the measures to estimate the MOS. Output-based measures, on the other hand, have 1. INTRODUCTION the advantage that no synchronization is needed. In addition, consumers can make judgement on the speech Objective measures have been the focus of much requality of the "output" speech without knowing the search for over two decades. Most of the existing ob"input" speech. It is thus more natural to use OBM jective measures are input-to-output measures, which rather than "input-to-output" measure to estimate the estimate speech quality by measuring the distortion subjective speech quality. between an "input" and "output" speech record, and mapping the distortion values to the estimated quality. A little explored problem is the estimation of transIn this paper two simple and e ective OBMs are mission quality using only the output speech without proposed based on the visual e ect of spectrogram of access to the "input". This class of measures are called the distorted speech. Spectrogram is a two-dimensional output-based measures (OBM). representation of the time-varying spectrum in which the vertical axis represents frequency and horizontal While most work on objective speech quality meaaxis represents time. The spectrum magnitude is repsures were focused on speech coding applications, this resented by the darkness of the marking on the graph. work is primarily focused on objective speech quality Spectrogram contains rich acoustic and phonetic inmeasures for the cellular phone environments. For celformation of the speech signal. According to early lular phone network operators, there is always a great work done on speech spectrograms it has been estabneed to assess regularly the speech quality of the cellulished that most of the underlying phonetic informalar phone network which is gradually changing due to tion can be recovered by visually inspecting the spec-

5000 0 −5000 −10000 0

2. CONSTRUCTION OF OBM Here are some observations of the spectrogram of typical "good" and "bad" output sentences collected from wireless phone, here "good" means the MOS (Mean Opinion Score) of the sentence is close to 5 and "bad" means MOS of it is close to 1:  Usually, the spectrogram of "good" speech exhibits ne, sharp and disjoint line structure which represent harmonics in the speech.

OBM1

=

PNi

=1

OBM2

[max(blocki ) , min(blocki )]

=

PNi

N

(blocki )

=1 V ariance

N

(1) (2)

where N is the total number of block partitioned from the spectrogram. The dynamic range and variance are observed to be large in the blocks of "good" sentences with discrete and dominant features in the spectrogram. The dynamic range and variance are usually small in the blocks of "bad" sentences that contain

1

1.5

2

2.5

3

2

2.5

3

3.5

4000 3000 2000 1000 0

0

0.5

1

1.5

3.5

Time (s)

Figure 1: Example from a Cantonese sentence, MOS=4.15 4

x 10 1.5 1 0.5

 The spectrogram of "bad" speech is just the op-

0 −0.5 −1 −1.5 0

0.5

1

1.5 Time (s)

2

2.5

3

0.5

1

1.5 Time (s)

2

2.5

3

4000 3000 Frequency

posite. It lacks periodicity in the frequency dimension. The spectrogram of the part which is contaminated by noise or in heavy Rayleigh fading exhibits uniform texture with no line features. This is expected because the harmonics are now corrupted by noise or fading and become invisible. The uniform texture is due to noise which a ects all frequency. Two speech samples with MOS of 4.14 and 1.07 are shown in Figure 1, and 2. Having such observations, the problem now is how to convert these features into a number, which represents the objective speech quality. Inspired by the idea of block-by-block processing in digital image processing, the spectrogram is partitioned into blocks. Two parameters, variance and dynamic range, will be computed from each block. The nal result is obtained by averaging all the blocks. Here are the proposed output-based measure OBM:

0.5

Time (s)

Frequency

trogram. An experienced spectrogram reader can correctly identify close to 90% of phonetic segments by visual examination[4]. Using spectrograms, one could work in image domain rather than in conventional audio domain. The spectrogram is thus a potentially good tool for objective speech quality measure. The advantage of the proposed OBM will be demonstrated by comparing with input-to-output measures using automatic synchronization.

2000 1000 0 0

Figure 2: Example from a Cantonese sentence, MOS=1.07 noise-like features resulting in a somewhat uniform distribution of energy throughout a given block. The choice of the size of the block is crucial. If the size is too small, each block will display uniform pattern and the discrete feature is lost. A reasonable choice of size is made due to the following assumptions:  Speech is quasi-stationary within a period of 10 to 30 ms. The features in that interval of time can be viewed as stationary. Thus a choice of the horizontal size of the block in this range is reasonable so that the main feature is retained in each block.  In the frequency axis, the important features are the harmonics which appear as equally spaced

lines. These harmonics are due to periodic vibration of the vocal cord. The pitch of human varies from 50 to 500 Hz . Therefore the vertical size of the block shall be in this range and each block will probably contain one harmonic in it. Blocks of di erent sizes are tested. A block size which produces good correlation with MOS is 15 ms and 468 Hz. One advantage of this measure is that it does not require intensive computation to evaluate. In addition no synchronization is needed. In this paper, speech is sampled at 8 kHz at 16 bit resolution. The spectrogram is obtained by dividing the speech signal into non-overlapping frames of length of 256, and applying 256-point FFT.

the objective speech measures was performed. The results are shown in Table 1. It is seen that OBM1 and OBM2 can achieve a correlation of 0.65 and 0.41 respectively. Thus, this suggests that OBM1 is better than OBM2 . Compare with the input-output based measures, the correlation of OBM1 is higher than most of the input-to-output measures. It is amazingly close to the 0.73 of Bark spectral distance (BSD) and the 0.71 of Mel spectral distance (MSD). This suggests that the OBM1 objective measure, based on measuring the dynamic range of blocks with the spectrogram, is e ective in re ecting the MOS of the speech quality in the cellular phone environment. And it has the advantage of low computation requirement. Most importantly, it does not require the input signal and needs not the synchronization between the input and output signals.

3. EXPERIMENTAL RESULTS

4. CONCLUSION

As shown in Fig. 3, the original undistorted speech signal is played from a DAT walkman to a cellular phone via a car kit interface box. The whole setup is installed in a van which moves along some designated route. Simultaneously, in the laboratory, the distorted speech signal is recorded from a normal telephone to a DAT deck via an interface circuit. Four routes are selected to yield various signal distortions due to the wireless channel including low power, high power, multipath fading and slow/fast fading.

In this paper, we propose two output-based objective speech quality which are based on the visual features of the spectrogram of the output signal. In our experiment, one of the output-based measures OBM1 that computes the dynamic range of the blocks within the spectrogram was found to have outstanding performance. It achieves a correlation of 0.65 which is higher than most input-to-output measures and is very close to the highest correlation of 0.73 and 0.71 achieved by BSD and MSD respectively.

After the recording, the distorted speech signal is transferred to computer through an A/D convertor with 16 bit resolution and a sampling frequency of 8kHz. For detailed study, 160 distorted sentences were selected with a mixture of good and bad quality. This experiment was done in Hong Kong, a city with the majority of the population speaking Cantonese, a Chinese dialect. As a result, the speech database used in the experiment consists of phonetically balanced Cantonese sentences with the four basic Cantonese syllable types V , C1 V , C1 V C2 and V C2 , where C is a consonant and V is a vowel or diphthong. Each sentence is about 20 syllables and encompasses all the nine tones of Cantonese.

5. REFERENCES

A survey was conducted to collect the subjective MOS on the 160 sentences. Meanwhile, the proposed OBM1 and OBM2 are computed for each of the 160 distorted sentences. The previously studied input-tooutput measures were also computed with the input and output waveforms synchronized automatically by aligning at the point with the highest cross correlation. Statistical correlation analysis between the MOS and

[1] S.R. Quackenbush, T.P. Barnwell, M.A. Clements, "Objective Measures of Speech Quality", Prentice Hall, 1988. [2] K.H. Lam, O.C. Au, C.C. Chan, K.F. Hui and S.F. Lau, "Objective Speech Quality Measure for Cellular Phone,", ICASSP May, 1996. [3] P.C. Ching, et al, "From Phonology and Acoustic Properties to Automatic Recognition of Cantonese", ISSIPNN, pp.127-132, Apr. 1994. [4] Mathew J. Palakal and Michael J. Zoran, "Feature Extraction from Speech Spectrograms using Multi-Layered Network Models", IEEE International Workshop on Tools for Arti cial Intelligence, Architectures, Languages and Algorithms, pp. 224-230, 23-25 Oct, 1989. [5] Jin Liang and Robert Kubichek, "Output-based objective speech quality", 1994 IEEE 44th Vehicular Technology Conference, pp. 1719-23 vol.3.

Data Collection Base Station

Cellular Phone

Telephone Car Kit

DAT Deck DAT Walkman

Original Sentences

Distorted Sentences

Data Processing

TDT A/D

• Objective Measures • Synchronize • Scaling Survey

• MOS

Correlation Coefficient

Figure 3: Block Diagram for the experiment

Objective Speech Measures SNR-based measure SNR Segmental SNR Frequency Variant SNR Frequency Variant Segmental SNR LPC-based measure LPC Log LPC Area Ratio Log Area Ratio PARCOR Log PARCOR Log Likelihood Ratio Weighted Likelihood Ratio Line Spectral Pair Cepstral Distance Combination of CD and weighted Likelihood Ratio Spectral Distance Based Measure Spectral Distance Energy Weighted Spectral Distance Frequency Weighted Spectral Distance Auditory Frequency Weighted Spectral Distance Log Spectral Distance Energy Weighted Log Spectral Distance Frequency Weighted Log Spectral Distance Auditory Frequency Weighted Log Spectral Distance Inverse Spectral Distance Inverse Log Spectral Distance Psychoacoustically Motivated Measure Mel Spectral Distance Bark Spectral Distance Other Measure Information Index Coherence Function Output Based Measure OBM1 OBM2

Table 1: Some of the input-output based objective measures and the proposed output based measure



0.15 0.52 0.45 0.61 0.32 0.47 0.33 0.35 0.51 0.21 0.49 0.24 0.47 0.51 0.51 0.50 0.43 0.46 0.55 0.60 0.57 0.65 0.59 0.51 0.56 0.71 0.73 0.64 0.62 0.65 0.41

Suggest Documents