VARIABLE RATE VECTOR QUANTIZATION OF THE ... - IEEE Xplore

VARIABLE RATE VECTOR QUANTIZATION OF THE SPEECH SPIECTRAL ENVELOPE Stan McClellan

Jerry D. Gibson

Electrical 8~ Computer Engineering Univ . Alabama- Bir ming ham

Dept. of Electrical Engineering Texas A&M University

Abstract - Effective rate variation during active speech is a necessary component of sophisticated variable rate speech compression schemes. Here, we use open-loop estimates of spectral shape t o roughly determine signal bandwidth and bit allocations for variable rate encoding of spectral parameters. We analyze the application of the relative entropy functional to sets of Line-Spectrum Pairs (LSPs) and transformbased generalized spectral distributions of [I]. We present experimental results demonstrating that the relative entropy of these quantities can be used t o good advantage in developing variable rate vector quantization schemes for the spectral envelope of speech signals.

t,ropy functional to each halfband, we can derive an instantaneous indicator of useful bandwidth. This technique can be applied recursively to further refine the estimate. These halfband indications can be used to reduce encoding rate in the context of a scalar coder by dynamically changing the sampling rate of the signal [3] and in the context of a vect,or coder by changing the allocation of rate for spectral parameters. The domain on which the (relative) entropy functional operates can be motivated by several considerations. We have used the entropy functional on normalized DCT coefficients of the splitband speech signal (“spectral entropy”) [2,3]. Others have used a similar approach on DCT coefficients of orthogonal elementary waveforms [5]. Here, we also apply the entropy functional to the set of line spectral frequencies [6].

INTRODUCTION

LINE SPECTRAL PAIRS Recently, spectral entropy has been proposed as a different indicator of spectral information content and coefficient rate [l].Here, we combine previous results which use subband spectral flatness measures for time-domain speech segmentation [2,3] with a different application of the concept of spectral distance. This approach produces encoding cues which allow for the efficient allocation of rate in both the time and frequency domains. The information theoretic functional relative entropy [4] is a convenient indicator of distance, since it produces a measure of the difference between a target distribution and a source distribution. The usual entropy functional can be derived as a special case of relative entropy where the source distribution is assumed to be uniform, and this case is of particular interest in waveform segmentation [l-3,5]. Thus, the use of relative entropy on appropriately normalized spectral data can be helpful in describing the flatness of the spectrum with respect to an average energy level, or in determining the evolution of nonstationary spectral representations. For example, by dividing the normalized spectrum into upper and lower halfbands and applying the enThis work was supported, in part, by NSF Grant No. 9303805, by the Texas Advanced Technology Program under Project No. 999903-017, and by a UAB Faculty Research Grant.

Efficient representation of the short-term spectral envelope is an important part of most speech coding architectures, and it is an area in which variable rate speech coding techniques can be applied with good success. A common parametric approach to envelope modelling is the use of autoregressive spectral estimation, or Linear Predictive Coding (LPC) techniques. The LPC parameters have been shown to be subjectively meaningful for speech coding in many practical systems and theoretical developments [7] For a given model order ( m ) ,linear predictive analysis of a time series results in a model which can be described by an all-pole filter, 1-1

where the parameters ai are commonly referred to as the LPC coefficients. An alternative representation of the LPC parameters is the set of Line Spectral Frequencies (LSFs) or Line Spectral Pairs (LSPs) introduced by Itakura [6]. LSPs have convenient properties relating to well-behaved dynamic range and preservation of filter stability and can be used to encode LPC

0-7803-3088-9/96/$4.000 1996 IEEE 208

spectral information more efficiently than other representations. The developmlent of LSP theory depends on a basic understanding of the lossless acoustic tube model of the human vocal tract for voiced speech. Under certain conditions, the tranisfer function of an m-segment vocal tube can be reduced t o an all-pole model with characteristic equation as in (1) [8]. Invoking boundary conditions at the lips and glottis which correspond to the completely lossless c.ase produces two polynomials P m + l (2) and Q m + 1 ( z ) that have three important properties: [9-111

60

: :

1; I:

I

1. The roots of

Pm+l( 2 ) and Q m + 1 ( z ) lie on the unit circle since they represent the transfer function of the vocal tube under completely lossless conditions.

2. The roots of Pm+l(z) and Qm+l(z) alternate on [O, .I.

3. The characteristic polynomial, Frequency (kHz)

is guaranteed t o be minimum phase as long as the previous properties are maintained. Figure 1: Estimated AR spectrum and LSPs

Distribution of Resonances The roots of P m + l ( z ) and Q m + 1 ( z ) can be expressed in terms of normalized frequency and so w = (w1, . . . ,um) is called the set of Line Spectral Frequencies (LSFs) or Line Spectral Pairs (LSF’s). The LSPs can be interpreted as the resonant firequencies of the vocal tract under the extreme conditions at the glottis. In fact, LSP frequencies tend to cluster along the frequency axis near peaks in the spectral envelope, and so they can be interpreted as a representation of an all-pole filter by means of the location density of p discrete frequencies [ll]. Fig. 1 shows the LPC spectral envelope (autocorrelation method) for a 10th order model of a 20 msec segment of speech. Overlaid on the figure are lines representing LSP locations for this model where dotted lines denote the root locations of P ( z ) and dashed lines denote roots of Q ( z ) . Th’e evenly spaced markers correspond to a uniform distribution of LSPs which would indicate a flat (white noise) spectrum. The interlacing propeirty of the roots of P ( z ) and & ( z ) coupled with the normalization of the sampling frequency provides an interpretation of the resonances of the spectral envelope as a generalized probability distribution, and the set of reciprocal differences l / A u j between successive LSP locations corresponds to a generalized probability mass function (gpmf) . With this interpretation, application of the entropy functional [4]

to the gpmf of vocal tract resonances is possible and has interesting perceptual interpretations. Higher values of the “line-spectral entropy” indicate a flatter spectrum, and lower values indicate a textured spectrum. Fortunately, “flat” spectra are not interesting in speech processing, whereas “lumpy” spectra are of particular importance. The relative entropy between two pmfs is defined by

Considering the pmfs in (2) t o be derived from LSPs (as generalized pmfs), the relative entropy can provide some indication of the similarity between two spectral envelopes. This leads t o some interesting interpretations for the selection of optimal paths t o minimize distortion and detection of change-points in speech waveforms. Consider Fig. 2. This figure contains time-series plots of fullband/halfband distance measures. The upper plots (H) are the spectral entropies defined in [1,3]. The middle plots are the line-spectral entropies defined in [3]. These curves have been used to determine regions of halfband energy concentration and V/UV regions. The lower plots are first-order relative entrop-

209

0

2000

4000

8000

6000

10000

12000

14000

Samples

Figure 2: Relative Entropy ies for each subband as in (2) with gpmfs defined by framewise LSP sets. The values are computed from sets of LSPs for consecutive frames (denoted q ) ,or D ( p ( w ~ ) l l p ( q - ~=) )D(p;Ilpi-l) for the fullband and two halfbands. In this case, relative entropy provides a measure of stationarity for the AR process estimates which have been derived from local segments of speech data, Since the relative entropy attains its minimum value for two pmfs which are identical, a value of D ( p ; l l p ; - l ) which is small indicates a slowly varying spectrum, whereas large values indicate a rapidly changing spectral envelope. Note the large region of low values of relative entropy in the lower plots of Fig. 2, around 10,000 to 13,000 samples. In this segment of speech (375 msec or 17-18 frames times at 20 msec per frame), the spectral envelope is roughly stationary as corroborated via spectrograms. Here, significant rate savings are available in a system architecture which exploits variable frame-rates for encoding spectral parameters [12]. As is shown in the figure, this relative entropy measure can be applied to any subset of elements of w to determine the rate of evolution of that group of resonances. Also, if we assume for each i that the spectrum of the current ( i t h )frame has evolved in one frame time from complete whiteness, %illpi-1)

= logm - H ( p i )

(3)

since p ; - l is the uniform distribution of m resonances. So, the line-spectral entropy (middle curves in Figure 2) can be seen as a particular interpretation of relative entropy which measures the spectral evolution with respect to whiteness at each frame time. This measure has immediate subjective interpretation, and can be applied easily in a system architecture which uses fixedframe-rate coding methods with variable bit allocation per frame for spectral parameters. The following sections discuss a representative system of this type. Quantization of LSPs

Many of the properties of LSPs, such as ordering, independence, and dynamic range have been examined closely [9,11,13-151 in the quantization of LPC parameters. The LSP representation tends t o be quite robust for various quantization schemes and the ordering property provides a convenient stability check for the synthesis filter. Vector quantization is especially appealing for LSP sets due to the high correlation between neighboring spectral lines and the intuitive spectral interpretation. Unfortunately, quantization with a single (full-search) codebook requires some 20-40 bits. Some of the most interesting results which address the problem of implementing LSP VQs relate to quantization of independent subsets of LSP vectors which correspond

210

roughly to upper and lower halfbands [14]. This technique has been called “split-vector quantization” (sVQ) and can achieve transparent spectral quantization with a fixed 24 bits/frame. Here, “transparent” is defined by conditions on the average spectral distortion (SD) for a large set of LPC test vectors [14]. In two-band sVQ, the lower 4 LSPs and upper 6 LSPs are quantized with independent 12-bit codebooks using a heuristically weighted Euclidean distance measure (see [14] for details). This distance measure penalizes distortion surrounding spectral peaks more heavily than distortion in valleys (which is subjectively valid) and produces transparent quantization at a rate lower than that which can be achieved with the usual Euclidean distance. During training of the codebooks, the weighted distance measure is used for clustering similar LSP vectors in the training set, but the uisual definition of the centroid (consistent with Euclidean distance) is used to compute the representative code vector for each cell [14]. Variable Rate Split

VQ

Since the LSPs conform to the definition of a distribution function, a split-bland entropy criterion on the unquantized LSPs can be used to bypass the highband codebook search (upper 6 LSPs) for vectors with limited high frequency texture. This approach provides an interesting implementation which combines the robust quantization properties of LSPs, the flatness measures of information theory, and a subband decomposition of the spectral representation to produce a variable allocation, fixed-frame-rate spectral quantizer. In this technique, for a low frequency codebook containing fullband LSP information’, the low order codebook can be searched with a distortion measure of higher dimension (wider bandwidth) to improve the high frequency spectral match without seriously compromising the fidelity of the low frequency representation. To illustrate the effect of distortion measure bandwidth in the codebook search, Fig. 3 shows quantized and unquantized representations of a 10th order LSP spectrum. Two of the quantized envelopes used a single 12-bit codebook which was clustered using a 4th-order distortion measure on the low 4 LSPs (narrowband), and searched using a distortion criterion classified as ‘The weighted distance measure can be focused, or restricted to a particular subset of LSPs by adopting zero-valued weights for LSPs outside the subset of interest. In this case, the clustering process involves distance computations which use a subset of each vector whereas the centroid computation (including accumulation) still involves the entire vector of LSPs. This allows for entries in a sVQ codebook to carry average LSP information for the entire frequency range of the training vectors in the cluster, although the characteristics of the cluster are determined by the limited-dimension distance measure.

either narrowband (low 4 LSPs) or wideband (low 8 LSPs). Also shown in the figure is the envelope quantized with fullband 24-bit sVQ as in [14]. The high frequency spectral texture of Fig. 3 indicates a lack of information, so a separate 12-bit encoding of the upper 6 LSPs is not necessary. Instead, the residual high frequency information contained in a vector from a low frequency optimized codebook (which has been trained with fullband information) can be used for a slightly degraded representation of the fullband spectral envelope. This results in an instantaneous 12-bit savings in quantization. For example, note that the envelopes for the 12-bit narrowband and 12-bit wideband distortion measures are identical in Fig. 3, and that a 24-bit sVQ encoding reduces the low frequency SD by only 0.6 dB on [O,lOOO] Hz. Also note that the minor high frequency distortion (around normalized frequency 0.25) incurred by either 12-bit encoding is more than 30 dB down from the level of the lowband spectral peak. Thus, a “flatness” criterion can be effectively used for bit allocation in a variable rate VQ scheme for the LSP representation. The use of the entropy functional is convenient for describing the flatness of distributions since entropy is related to the discrimination of a target distribution with respect to the uniform distribution. The use of this functional on appropriately normalized spectral data is helpful in describing the flatness of the spectrum with respect to an average energy level. Thus, a split-band, entropy-based criterion is a natural choice for a procedure which selectively bypasses the high frequency sVQ codebook search to achieve a variable rate scheme for encoding LSPs. In sVQ, the high frequency codebook is designed to minimize distortion on the 6 high order LSPs. A measure of necessity for the bits required by the high frequency codebook indices can be obtained by computing the subband entropy from the 6 high order LSPs of the unquantized spectrum. Codebook information from the high frequency optimized codebook is unnecessary for spectra whose entropies are high since a high value of entropy indicates a flat spectrum. In this case, residual high frequency information contained in the low frequency optimized sVQ codebook may be adequate for fullband quantization. The average fullband, low frequency, and high frequency SD incurred in such a variable rate sVQ scheme are shown in Figures 4 and 5 for a range of entropy thresholds for the high frequency codebook. The entropy values were normalized according to the size of the high frequency LSP alphabet, and the coding distortion was computed as an ensemble average of SD for a large collection of LSP vectors. These vectors were encoded with our variable rate sVQ scheme at each of

21 1

,

-

1

I

I

I

Unqunntlzed

~

12-bit lowpaas-optimized VQ (msarohod U4th- lorpa8m or rldmband dimtortion)

-

0

-10

--

24-bit aplit-VQ

-.

7

0

0.1

5

0.4

0:3

0.2

N o r m a l i z e d Frequency

Figure 3: 12-bit VQ of lowpass segment, lowpass distortion measure.

each entropy threshold, the SD per-vector was computed for the frequencies spanned by the low 4 and high 6 LSPs as well as for all 10 LSPs. Frequencyweighted distortion measures of dimension 5, 8, or 10 were used in searching the low frequency codebook’ for vectors where the entropy threshold for the high 6 LSPs was exceeded (ie. the high frequency spectral envelope was sufficiently “flat”). The separate curves in the figures demonstrate the behavior of SD versus entropy threshold for the various distortion measures. Note that the entropy threshold is equivalent to average coding rate since it represents a ‘(flatness tolerance level” for requiring either a 12-bit or a 24-bit formant encoding. Note from Fig. 5 that the 8-dimensional distortion criterion (labeled 8-dim) maintains low frequency. performance very close to that of the 5-dimensional case (labeled 5-dim) for all entropy thresholds. The 10-

3,5-- - -

--- ---~

I

- _.

, \

...’... .

I 52.5

1,5t I

’\,

1OS1

I

Entropy

2The codebook was clustered by minimizing frequency weighted distortion for the low 4 LSPs but centroid computations (accumulation, averaging) used information from all 10 LSPs.

Figure 4: Fullband spectral distortion vs. entropy.

21 2

.............

.....

Table 1: Codebook (CB) and distortion configurations for split-VQ.

-&dim

I System 1

..... l e d i m 0.55

0.6

1 0.65

0.7

0.75 0.8 Entropy

0.85

0.9

1

0.95

_-I---

L--------

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Entropy

-i

0.9

0.95

Figure 5: Splitband spectral distortion vs. entropy. dimensional criterion (labeled 10-dim), however, sacrifices low frequency performance to improve the fullband match. Further, note that the 8-dim approach produces average SD which is better than the 5-dim SD in the high frequency band. The enhanced performance of the &dim approach is due to the wider bandwidth of the distortion measure. Thus, the 8-dim “wideband” criterion achieves a low frequency distortion comparable to the 5-dim “narrowband” approach and much better than the 10-dim approach. The &dim criterion also greatly improves the high frequency SD over that achieved by the 5-dim approach.

EXPERIMENTAL RESULTS We designed codebooks for the variable 12/24-bit, fixed 24-bit and fixed 18-bit sVC) of 10th order LSPs as shown in Table 1. These codebooks were used in a variable rate CELP coder to obtain the following objective and subjective results. During encoding, the CELP excitation gain and long-term spectral parameters were not quantized. Each LPC computation (20 msec intervals) had 2 lag updates and 4 pitch coefficient updates, as well as 4 searches through the excitation codebook. The excitation codebook contained 256 center-clipped Gaussian vectors. The entropy values used in the variable rate 12/24-bit scheme were normalized to [0,1], and high frequency codebook search was bypassed for spectra with entropies greater than the threshold. Thus, only 12 bits were needed to encode these spectra. Based on the observations pertaining to the lowband and fullband SD versus entropy (Fig. 4 and Fig. 5), we used

I

1

I

type .Fixed Fixed Fixed Variable

I

I

I

Total Rate 12-bit 18-bit 24-bit fullband halfband

11

Low CB dist. 12 10-D 4-D 9 12 4-D 12 I 4-D 12 8-D

11 bits I *I

1)

11

I

11

ji ,I

11

I(

High CB bit; I dist. N/A 9 12 12 N/A

I

I

I

N/A 6-D 6-D 6-D N/A

I

I

I

1

a peak weighted distortion measure of dimension 8 for searching the low frequency codebook during frames where high frequency encoding was not required. Simulation results for our CELP system for a male and female speaker are shown in Table 2 for a small range of high frequency sVQ entropy thresholds. Also shown are results for fixed-rate 24-bit, 18-bit, and 12bit fullband sVQ encoding, and for unquantized LSPs. The results for segmental SNR and the female speaker indicate that the variable rate sVQ is objectively equivalent to the fixed 18-bit sVQ at an entropy threshold of around 0.85, which corresponds to an average rate of around 19-bits per frame. Equivalent SEGSNR for the male speaker occurs around an entropy threshold of 0.78 which corresponds to an average rate of about 16 bits per frame. For the usual SNR, 18-bit equivalence occurs close to 0.81 (16.9 bits/frame) for the female, and around 0.79 (16.2 bits/frame) for the male. The 18-bit equivalence for frequency-weighted SNR occurs around 0.8 (16 bits) for the female and around 0.8 (17 bits) for the male. These rows are indicated in the tables. However, objective performance measures are often misleading in speech processing research, and in assessing the performance of complex and non-standard techniques, subjective measures are sometimes easier to interpret. To complete our analysis, we performed an extensive paired comparison test between the CELP-coded sequences with the fixed rate 18-bit sVQ and the variable rate sVQ for several high frequency entropy thresholds. Untrained participants chose which component of the pair “sounded better” and these results were compiled based on a percentage of responses favorable to variable rate sVQ. Results of these comparisons are shown in Table 3. In most cases, the variable rate sVQ speech quality was at least comparable to the quality of the speech reconstructed using the 18-bit codebook since the variable rate scheme used 24 bits in wideband segments (segments with significant high frequency tex-

213

System Fixed Rate

Entropy thr.

b/fr

I

Female speaker SNR SNRSEG

11.2

I

9.64

I

Male speaker SNRSEG

I

SNRfw

11

b/fr

I SNR 1

1

12.68

11

11.9

I

1

SNRfw

I

10.39

Vbl Rate

0.55

11

I

8.85

6.64

I

5.95

REFERENCES

ture), whereas a maximum of 18 bits were available in the fixed-rate scheme. 50% equivalence between fixedrate 18-bit sVQ and variable rate 12/24-hit sVQ seems to occur around an entropy threshold of 0.8 (16.4 bits) for the female and around 0.8 (17 bits) for the male speaker. These rows are highlighted in the table and the equivalent rates are similar to the objective results. Significant degradation occurs in variable rate sVQ when a very low entropy threshold is used since variable rate sVQ tends toward a 12-bitJ fixed-rate, single codehook scheme where the codebook is optimized for the low 4 LSPs, Thus, the low frequency optimized codebook is used for a larger proportion of the frames, including those having significant high frequency texture.

J . D. Gibson, S. P. Stanners, and S. A . McCleIlan, “Spectral entropy and coefficient rate for speech coding,” in Conf. Rec. 27th Annual Asilomar Conf., (Pacific Grove, CA), pp. 925-929, November 1993.

S. McClellan and J . Gibson, “Spectral entropy: An alternative indicator for rate allocation?,” in Proc IEEE Int. Conf. on Acoust., Speech, Signal Processing, (Adelaide, Australia) , pp. 1.201-1.204, April 1994.

S. McClellan and J . Gibson, “Variable rate tree coding of speech,” in Proc. IEEE Wichita Conf. on Commun., Networking, and Signal Processing, (Wichita, KS), pp. 134-139, April 1994.

CONCLUSIONS

T. M. Cover and J . A. Thomas, Elements of Information Theory. New York, NY: Wiley, 1991.

The implementation complexity of split-vector LSP VQ is reduced substantially from the single codebook case due to the use of smaller, relatively independent codehooks. This makes VQ of LSP parameters feasible. CELP-based, variable rate sVQ based on a high frequency entropy consideration is shown to produce objective results equivalent to a fixed-rate 18-bit sVQ scheme, and the variable rate sVQ scheme has an average rate of 17-18 bits/frame (SEGSNR). However, results of subjective testing indicate that the variable rate scheme can provide equivalent subjective quality at a rate of 16-17 bits per frame for the same configuration.

E. Wesfreid and M. V. Wickerhauser, “Adapted local trigonometric transforms and speech processing,” IEEE Trans. Signal Processing, vol. 41, pp. 3596-3600, December 1993. F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signals,” J . Acoust. Soc. Am., vol. 57, no. 535(A), 1975.

R. Gray, A. Buzo, J. A.H. Gray, and Y. Matsuyama, “Distortion measures for speech processing,” IEEE Trans. Acoust., Speech, Signal

214

Table 3: Var Entropy

Avg. Rate 20.3 19.1

Thr Female

0.87 0.85

0.88 0.85

I

I

21.6 19.6

sVQ Preference (sVQ118-bit) (18-bit,sVQ) 75% 90% 60% 80%

40% 40%

85% 85%

Overall 83%

70%

63% 63%

Processing, vol. AS!SP-28, pp. 367-376, August 1980.

vol. H80-29, pp. 229-233, July 1980. (in Japanese).

L. Rabiner and R. Skhafer, Digital Processing of

[I41 K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parameters at 24 bits/frame,” IEEE Trans. Speech and Audio Processing, vol. 1, pp. 3-14, January 1993.

Speech Signals. Englewood Cliffs, NJ: PrenticeHall, 1978.

F. K. Soong and B H. Juang, “Line spectrum pair (LSP) and speech data compression,” in Proc IEEE Int. Conf. on Acoust., Speech, Signal Processing, (San Diego, CA), pp. 1.10.1-1.10.4, March 1984.

[15] F. K. Soong and B. H. Juang, “Optimal quantization of LSP parameters,” IEEE Trans. Speech and Audio Processing, vol. 1, pp. 15-24, January 1993.

G. Kang and L. Fransen, “Application of linespectrum pairs to low-bit-rate speech encoders,” in Proc IEEE Int. Conf. on Acoust., Speech, Signal Processzng, (Tampa, FL), pp. 244-247, March 1985.

N. Sugamura and N. IFarvardin, “Quantizer design in LSP speech analysis-synthesis,” IEEE J. Sel. Areas zn Commun., vol. 6 , pp. 432-440, February 1988.

J . Gibson, M . Moodie, and S. McClellan, “Variable rate techniques for CECLP speech coding,” in Conf. Rec. 29th Annual Aszlomar Conf., (Pacific Grove, CA), November 1995.

N. Sugamura, S. Sag,ayama, and F. Itakura, “A study on speech quality of synthesized speech by LSP,” Trans. Committee on Speech Res., ASJ,

21 5