Wavelet Based Speech Coding Using Orthogonal ... - CiteSeerX

0 downloads 0 Views 292KB Size Report
Department of Electrical Engineering and Institute for Systems Research. University of Maryland ... or vocoders). Waveform coders .... Huffman. Coding. Channel. Figure 4. Block diagram of the proposed method. For completeness here we ...
Wavelet Based Speech Coding Using Orthogonal Matching Pursuit Ramin Rezaiifar Hamid Jafarkhani Department of Electrical Engineering and Institute for Systems Research University of Maryland, College Park, MD 20742 emails : ramin,[email protected]

ABSTRACT

A novel waveform approximation called Orthogonal Matching Pursuit [1] (OMP) has been exploited to represent a speech waveform with a set of complex coecients. The OMP algorithm proved to be quite e ective in approximating waveforms that have their energy concentrated in a short ranges of frequency. The spectrum of speech signals is usually well-localized in frequency and therefore OMP is a suitable tool for the purpose of speech coding. The algorithm has been applied to a standard test signal.

I Introduction

Speech coding methods can be categorized into two major groups: those based on reconstructing the speech waveform (waveform coders), and those that build a speech production model to reproduce the speech (voice coders or vocoders). Waveform coders, themselves, can be divided into two groups: (i) time{domain waveform and (ii) spectral{domain waveform coders. The former takes advantage of the fact that speech is a semi periodic signal meaning that it is periodic over short time intervals, whereas the latter uses the localized properties of the speech signal in frequency domain to code the waveform. In the conventional waveform coding schemes a unitary and linear transform is being used to convert a vector of source samples into another vector, often called transform coecients, and the corresponding coecients are quantized subsequently. Almost all of the transform coding schemes try to achieve a perfect reconstruction (disregarding the quantization error). Considering the fact that the coecients representing the waveform have to be quantized prior to transmission and that by nature, the quantization is a lossy operation, it seems reasonable to relinquish perfect reconstruction in the transformation phase in exchange for further compression. In other words, since the quantization will distort the reconstructed speech waveform, there is no reason to insist on a set of coecients that leads to a perfect reconstruction of the original speech waveform. In a recent paper, Vetterli and Kalker [6] proposed a motion compensated video coding based on matching pursuit. In their method, at each iteration a vector is picked from the dictionary such that optimizes the rate{distortion

function. The organization of the paper is as follows: Section II describes the overall proposed method including a brief review of the OMP algorithm. Section III, describes the quantizer operation and, nally, Section IV includes some of the results obtained by applying the method to a standard test signal.

II System Description

As with most of the existing speech coding algorithms, the procedure starts with splitting the signal into overlapping segments called frames. The length of each frame clearly has an e ect on the processing delay and general performance of the algorithm. Subsequently, each frame is multiplied by a trapezoidal shape window which is shown in Fig. 1. 1

0.8

0.6

0.4

0.2

0 0

50

100

150

200

250

Figure 1. The trapezoidal lter used for windowing. The purpose of the windowing operation is to alleviate the discontinuity that may result in after concatenation of consecutive frames in the reconstruction phase. Since the energy of the speech signal is usually concentrated in a relatively small range of frequencies, it is appropriate to perform the coding procedure in the frequency domain where the spectrum of the signal has a compact support. In particular, if the sampling rate in is 8KHz, there would be no meaningful frequency contents at above 4KHz. The idea behind the proposed compression algorithm is to approximate each frame with a linear combination of

fbM =

M X k=1

c k xn ;

(II.1)

k

0.8 0.7 0.6 0.5 0.4 0.3

is an approximation of f . The number M in (II.1), ranging from zero to N , is picked such that it results an approximation error less than a prede ned tolerance . That is, n

1 0.9

Modulus

prede ned building blocks picked from a set D := fxk gNk=1 , called the Dictionary. Let f denote the Fourier transform of a given frame of the original waveform. Using the orthogonal matching pursuit (OMP) algorithm [1], we represent f with a set of complex numbers fck ; k = 1; : : : ; M g such that fbM de ned by the equation

o

M = min n j kf ? fbn k   :

(II.2)

Note that the dictionary D does not comprise a basis for L2 and its elements are chosen such that the number of elements contributing in the approximation, fbM , is as small as possible. In this work we do not o er a systematic way for designing the dictionary D, and instead we choose the elements in D by inspecting the spectrum of some selected speech signals. Nevertheless, lack of having a systematic method to nd the \best" dictionary is not crucial to the OMP algorithm as long as D is designed to be \large" enough and possibly redundant. The OMP algorithm picks up the best (in the sense that the norm of the error is minimized) set of indices fnk ; ck ; k = 1; : : : ; M g such that the inequality (II.2) is satis ed. Considering the fact that speech is inherently a localized signal in the frequency domain, wavelet would be a good candidate for the dictionary elements. The entire approximation is being done in the frequency domain, therefore a building block, xk , is a waveform in frequency domain. If (!) is the mother wavelet, the wavelet m;n (!) is de ned by the following formula 2 m (II.3) m;n (!) = am= 0 (a0 ! ? nb0 ); where a0 and b0 are dilation and translation step sizes, respectively, and m and n are the so{called dilation and translation levels. The magnitude of some selected elements of the dictionary are illustrated in Fig. 2. By inspecting the spectrum of an arbitrary frame of speech, one realizes that the modulus of the spectrum at higher frequencies is smaller than the modulus of the spectrum at lower frequencies. The high frequency components, however, happen to have a signi cant contribution in the quality of speech and its intelligibility (especially for female voices). Therefore, a simple strategy is to amplify the signal in those higher frequencies before feeding the spectrum to the OMP block. This is essentially equivalent to changing the l2 norm that the OMP uses, into a weighted norm where higher frequencies have larger weighs. The spectral shaping of the quantization noise can be used to take advantage of the noise perception of the

0.2 0.1 0 0

500

1000

1500

2000 2500 Frequency [Hz]

3000

3500

4000

Figure 2. Modulus of some selected elements of the dictionary. Without pre−emphasis 100 80 60 40 20 0 0

500

1000

1500

2000 2500 Frequency[Hz]

3000

3500

4000

3000

3500

4000

With pre−emphasis 150

100

50

0 0

500

1000

1500

2000 2500 Frequency[Hz]

Figure 3. Pre{emphasizing puts more weight on higher frequencies. human auditory system for speech. Fig. 3 illustrates the e ect of the pre{emphasizing. The block diagram of the proposed method is depicted in Fig. 4. After segmentation and windowing operations, the Fourier transform of the signal is taken and then passed through a pre{emphasizing block to boost the higher frequency components. The resulting spectrum constitutes the input to the OMP algorithm. The OMP block generates two sets of numbers: fck g, the waveform coecients, and fnk g, the corresponding indices for that particular frame. The coecients are then normalized according to the energy of the frame and fed to the input of the quantizer block. Since it is crucial to transmit the indices fnk g in a lossless manner, we use a Hu man coding scheme to compress this set of numbers. The coecients fck g are compressed using a DPCM coding scheme. Frames with lower energy have less importance and, therefore, may be coded with a higher error. Thus, we adaptively select the tolerance level in the OMP block according to the frame energy to further reduce the trans-

mission bit{rate. f(t)

Speech

Segmentation

Windowing

FFT Energy Computation

Channel

Quantizer

f(ω) Pre-emphasis 8-bit quantized Energy

Normalization

Huffman Coding

{c k} {n k}

OMP

Figure 4. Block diagram of the proposed method. For completeness here we present the OMP algorithm in detail. There are some computational details that make the implementation of the algorithm more ecient which we do not repeat and are discussed in [1].

2.1. The OMP Algorithm

Before presenting the algorithm and the main convergence theorem, we need to introduce some notations. Let V be the space spanned by the elements of the dictionary D, and V? be the orthogonal complement of V. At the nth iteration, the OMP picks n elements from the dictionary. With a slight abuse of notation, we represent the span of the selected elements at the nth iteration by Vn . Note that Vn does not necessarily represent the span of the rst n elements of the dictionary because, as will be shown later in this section, the OMP may reorder the dictionary. Furthermore, let PV f denote the projection of f 2 L2 onto the subspace V. The following theorem shows the main property of the OMP.

Theorem II.1 [1] For f 2 L2 , let Rk f be the di erence between f and the approximation obtained by OMP in the kth iteration. That is, Rk f = f ? fk : Then (1) kRN f ? PV? f k = 0: (2) fn = PVn f; n = 0; 1; 2; : : : :

It is worth mentioning that the major di erence between OMP and Matching Pursuit (MP) proposed by Mallat and Zhang in [7] is that in OMP the convergence occurs after N steps (recall that N is the number of the elements in the dictionary), whereas MP only guarantees a convergence after in nite number of steps. That is, according to MP we have lim kR f ? PV? f k = 0: k!1 k The OMP algorithm is as follows:

Initialization:

f0 = 0; R0 f = f; D0 = fg x0 = 0; a00 = 0; k = 0:

(I) Compute fhRk f; xn i ; xn 2 D ? Dk g: (II) Find xn +1 2 D ? Dk such that

Rk f; xn +1  sup jhRk f; xj ij ; 0 <  1: k

k

j



(III) If Rk f; xnk+1 < ; ( > 0) then stop.

(IV) Reorder the dictionary D, by applying the permutation k + 1 $ nk+1 . (V) compute bkn kn=1 , such that, xk+1 =

k X bkn xn + k and; n=1

h k ; xn i = 0; n = 1; : : : ; k:

(VI) Set,

?2 akk+1 +1 = k = k k k hRk f; xk+1 i ; akn+1 = akn ? k bkn ; n = 1; : : : ; k; and update the model, k+1 X akn+1 xn n=1 Rk+1 f = f ?[ fk+1 Dk+1 = Dk xk+1 :

fk+1 =

(VII) Set k k + 1, and repeat (I)-(VII).

III Quantization

Fig. 5 illustrates a general block diagram for a DPCM system. We have used an L{level uniform threshold quantizer (UTQ) in which the threshold levels are chosen uniformly as illustrated in Fig. 6. The quantizer is designed for a unit-variance Gaussian source and the centroid of each interval is used as its reconstruction level. It is shown in [2] that the performance of the UTQ followed by an entropy coder is almost the same as the performance of the optimum scalar quantizer when there is no constraint on the number of quantization levels. In practice, a xed number (depending on the rate) can be chosen for L without resulting in a major degradation in performance. Another fact shown in [2] is that L should be an odd number in order to achieve rates less than one bit/sample. Also, it is proven in [3] that under certain mild constraints on the source pdf, the following hold in the high bit-rate region: di (ri ) = ki 2?2ri (III.1) ri = hi ? log2 i ; (III.2) where hi is the rst-order di erential entropy of the source, ki is a source dependent constant, ri is the rate, i is the quantizer step{size, and di (ri ) is the resulting distortion.

Assuming the same distribution for the K number of coecients, the overall distortion is obtained by

D=

K X i=1

di (ri ) =

K X i=1

ki 2?2r :

(III.3)

i

The objective is to minimize D subject to the P constraint that the total number of bits is R; that is, Ki=1 ri = R. To solve the problem at hand, we use a Lagrange multiplier as follows. First, we de ne

J () =

K X i=1

ki 2?2r +  i

K X i=1

ri :

(III.4)

Taking derivative with respect to ri or equivalently with respect to i , and setting it to zero yields (III.5) 2k0 j ?  log2 e = 0 8j: j Assuming the same distribution for all coecients, (III.5) results in j = const: 8j: (III.6) This means that the same step{size should be used for all coecients. +

+-

Ent. Encod.

Q

Predict.

Ent. Decod.

+

+

Predict.

Figure 5. A general block diagram for DPCM. 

T0 T1 T2 T3 T4 T5

-

Figure 6. The threshold levels of a 5-level UTQ.

IV Results

We applied our speech coding algorithm to a standard test signal that contains PCM recordings of \phoneme{ speci c" sentences [4]. The original recordings were made in a quiet room using a linear electret microphone. They were low{pass ltered at 3.9 KHz at the time of recording. Subsequently they were digitally ltered to conform with the fsvs analog lter speci cations. The speci cation of the lter and some other information related to the original signal is as follows:

Flat response between 125 and 3500 Hz above 3500 Hz ( - 5 dB/100 Hz) Sampling frequency : 8 kHz Resolution : 16 Equipment : DSC-200 Source : Indian Hill To assess the e ectiveness of a speech coding algorithm, one should take three factors into account: (1) Bit rate (kbits/s). (2) Complexity (average number of operations required to code one second of speech). (3) Quality of speech. Measuring the quality of the reconstructed speech is a subjective matter and hard to quantify. A popular measure of speech quality is mean opinion score (MOS) which is a number between zero and ve. Speech quality is usually classi ed into the following four groups[5]: (1) Broadcast quality: high{quality speech. (2) Toll quality: public telephone quality (3) Communication quality: highly intelligible speech with relatively low quality. (4) Synthetic Quality: arti cial{sounding quality. The result of proposed method, as with the most of the waveform coding schemes, ts into the second category mentioned above. Spectrogram is considered to be a way to visualize the frequency contents of the signal over time; it represents the magnitude of the short term Fourier transform of the speech. Figures 7 and 8 depict the waveforms in time domain and the spectrograms of both the original and reconstructed signals. By observing the spectrograms, we see that the algorithm successfully rebuilds those areas of the original spectrogram that have the most energy (the darker islands in Fig. 8). The resulting bit-rate (after compression) together with the parameters used for the simulation are shown in Table 1. The bit{rate of 16 kbits/s is considered to be a lower limit for the existing standards that result in a toll quality. This is while our algorithm achieves a bit-rate of 6.56 kbits/s which is obtained by setting the parameters as in Table 1.

V Conclusion

We used a successive approximation method called Orthogonal Matching Pursuit to represent a speech waveform as a linear combination of a collection of prototype waveforms. Since the spectrum of the speech signal has a compact support, it seems reasonable to perform the approximation in

2

x 10

ciency of the method. Also ne tuning of the parameters is likely to improve the performance.

Reconstructed Waveform

4

1

References

0 −1 −2 −3 0

2

x 10

2000

4000

6000

8000

10000

12000

14000

16000

18000

12000

14000

16000

18000

Original Waveform

4

1 0 −1 −2 −3 0

2000

4000

6000

8000

10000

Figure 7. The reconstructed signal and original signal.

Figure 8. Spectrogram of the reconstructed signal (lower) together with that of original signal (upper). the frequency domain. Moreover, because the speech waveform has most of its energy concentrated around short ranges of frequency (localized in the frequency domain), wavelets can serve as a suitable set of prototype waveforms from which to build a dictionary. Once the approximation is completed, a set of coecients and their corresponding indices (that identify elements of the dictionary) can be transmitted through the channel to the receiver. The algorithm changes the approximation tolerance adaptively and reduces the bit{rate for portions of the speech that are less signi cant in terms of their energy contents. This paper only re ects preliminary results and work is in progress to further increase the eTable 1. Simulation Parameters and Results. Sampling Rate 8 KHz Frame Length 32 ms No. of samples per frame 256 Overlap between frames 25% Size of Dictionary 1024 No. of Bits per Index 5.11 No. of Bits per Coe . 5.52 Avg. No. of OMP Coe .'s per sec. 350.5 Operating Bit Rate 6.56 (kbits/sec) Original Bit Rate 96 (kbits/sec) Compression Ratio 14.6

[1] Y.C. Pati, R. Rezaiifar, P.S. Krishnaprasad, \Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition", The 27th Annual Asilomar Conference on Signal Systems and Computers, Paci c Grove, CA., Nov. 1993. [2] N. Farvardin and J.W. Modestino, \Optimum Quantizer Performance for a Class of Non-Gaussian Memoryless Sources,"IEEE Trans. Inf. The., Vol. IT-30, No. 3, pp. 485-497, May 1984. [3] H. Gish and J.N. Pierce, \Asymptotically Ecient Quantizing," IEEE Trans. Inf. The., Vol. IT-14, No. 6, pp. 676-683, Sep. 1968. [4] Huggins and Nickerson, \Speech Quality Evaluation Using 'Phoneme Speci c' Sentences,"JASA, V77, No.5, pp. 1896-1906, May 1985. . [5] M. Yacoub, Foundations of Mobile Radio Engineering, CRC Press, 1993. [6] M. Vetterli and T. Kalker, \Matching Pursuit for Compression and Application to Motion Compensated Video Coding," Proceedings of IEEE International Conference on Image Processing, Austin, TX, pp. 725{729, Nov. 1994. [7] S. Mallat and Z. Zhang, \Matching pursuits with time{frequency dictionaries," IEEE Trans. on SP, Vol 41, No. 12, pp. 3397{3415, Dec. 1993.