120
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997
Variable-Rate CELP Based on Subband Flatness Stan McClellan, Member, IEEE, and Jerry D. Gibson, Fellow, IEEE Abstract— Code-excited linear prediction (CELP) is the predominant methodology for communications quality speech coding below 8 kbps, and several variable-rate CELP schemes have been discussed in the literature, including QCELP, the variable-rate wideband digital cellular mobile radio speech coding standard specified in IS-95. A key component of these speech coders is the detection and classification of speech activity, and several cues for rate variation have been studied, such as measuring short-term speech energy, deciding whether the speech is voiced or unvoiced, or making more sophisticated phonetic classifications. We present a new method for rate variation based on a measure of subband spectral flatness, called spectral entropy. Spectral entropy is a normalized indicator of the texture of the input spectrum and is thus less dependent on speech and background noise energy variations. We present some results on the use of spectral entropy for voice activity detection across subbands and then evaluate using spectral entropy for deriving mode and rate allocation cues for a variable-rate CELP coder operating at an average rate of 2 kbps. To achieve communications quality speech at this rate, we develop a new split-band vector quantization (VQ) technique for representing the line spectral pairs and a multiple codebook approach for efficiently quantizing the coefficients of a three-tap pitch predictor, called lag-indexed VQ. Index Terms— Code-excited linear prediction, vector quantization, speech coding (low-rate), variable-rate coding, entropy (spectral)
I. INTRODUCTION
W
ITH THE deployment of digital multiple-access schemes for telephony and personal communications networks as well as the proliferation of digital speech storage applications, the development of variable-rate speech coding techniques has emerged as a significant research area [5]. Due to the excellent subjective performance achievable with codeexcited linear prediction (CELP) coders, this architecture has taken a predominant role in medium-rate and low-rate speech coding as evidenced by the adoption of fixed-rate standards such as U.S. Federal Standard 1016 (FS-CELP) [6], IS-54 (VSELP) [7], and ITU-T G.728 (LD-CELP) [8]. A well-known example is QCELP, the variable-rate speech coder standardized as IS-95 by the Cellular Telecommunication Industry Association (CTIA) for wideband digital cellular mobile radio in North America [1]. In addition to QCELP, several variable-rate CELP architectures have been discussed in the literature. The most notable of these coders are based on
Manuscript received June 16, 1995; revised May 19, 1996. This research was supported in part by the NSF under Grant NCR-9303805, by the Texas Advanced Technology Program under Project No. 999903–017, and by a UAB Faculty Research Grant. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. W. Bastiaan Kleijn. S. McClellan is with the Department of Electrical and Computer Engineering, University of Alabama, Birmingham, AL 35298-4461 USA (e-mail:
[email protected]). J. D. Gibson is with the Department of Electrical Engineering, Texas A&M University, College Station, TX 77843 USA. Publisher Item Identifier S 1063-6676(97)01898-1.
an open-loop analysis of the speech signal to determine coding mode or state, and thereby bit allocation. Some multimode CELP structures use sophisticated cues to perform “phonetic segmentation” of the speech waveform [3], [4]. Regardless of the philosophy used for segmentation or mode cues, these coders are composed of two distinct components: a voice activity detector (VAD) and an approach for coding active speech. Both components are necessary for sophisticated schemes. Here, we present a new method for a VAD and variable-rate techniques for coding active speech that are not based on traditional measures of speech activity. Instead, we employ the entropy functional [9] to measure the gross shape of the short-term speech spectrum and to derive mode cues for a finite-state CELP coder. Our interest in the spectral entropy was motivated by the work of Campbell and his interpretation of spectral entropy as coefficient rate [10] and, recently, others have investigated spectral entropy for activity classification in transform coding. We present some results showing the utility of spectral entropy as a speech classification tool and point out some of its robustness properties. However, the main emphasis of the paper is the incorporation of spectral entropy as a mode selection and rate allocation cue in a variable-rate CELP coder. To achieve communications quality speech at an average rate of 2 kbps, our coder also includes two other innovations. One is that we modify the usual approach to split-band vector quantization (VQ) design for line spectrum pairs (LSP’s) to better match our two band spectral content decisions. The second is that we present a new multiple codebook approach to quantizing the parameters of a three-tap pitch predictor, called lag-indexed VQ. Our entropy-based VAD is discussed in Section II, where we contrast usual techniques with our results based on subband spectral flatness. Spectral entropy is defined and discussed, and conclusions are drawn regarding applications with additive background noise. Application of the entropy functional to bit allocation in a split-band vector quantization scheme is proposed and simulated in Section III. In Section IV, we briefly discusses our efficient pitch filter encoding method [11], [12], which incorporates the favorable characteristics of current pitch coding schemes while maintaining low average rate and enhancing synthetic speech quality. Experimental results with an implementation of the variable-rate CELP algorithm that uses spectral entropy both for voice-activity detection and for modulating the coding rate of spectral parameters and excitation are presented in Section V, and we conclude with a summary in Section VI. II. SPECTRAL ENTROPY
AND
SPEECH ACTIVITY
Voice-activity detection is an important part of a variablerate speech coder. Significant gains in average coding rate
1063–6676/97$10.00 1997 IEEE
MCCLELLAN AND GIBSON: VARIABLE-RATE CELP
121
Fig. 1. Split-band spectral entropies and various decision rules.
are achievable via appropriate reductions in the bit rate during “silence.” An often-used approach is to assume that the speech to be encoded is taken from one side of a balanced two-way conversation in which each participant speaks for 50% of the duration of the coding interval. Thus, the remaining 50% of the coding interval is “silence” and can be encoded at the lowest rate for which the coder is designed. This assumption of a voice activity factor (VAF) of 0.5 is conservative [13], [14]. Once the admissibility of the VAF is defined, the VAD must be designed to be robust to background noise and not prone to produce erroneous speech/nonspeech indications [5]. VAD schemes are often used in an open-loop configuration and are based on classical measures of speech activity such as presence of pitch, zero crossings, reflection coefficients, forward/backward prediction gains, subband energies, and so on. Here, we employ a VAD where the decision rules are derived from subband measures of spectral flatness using the concept of spectral entropy. A. Spectral Entropy Campbell showed how the asymptotic equipartition property (AEP) from information theory could be used to derive a quantity depending on spectral entropy that he interpreted as a coefficient rate for the underlying random process. We present an outline of Campbell’s derivation to motivate our approach [10]. See also Abramson [15]. For sample functions , of a stationary random process with power spectral density , Campbell expands
the product of these sample functions in a Karhunen–Loeve expansion over the cube . Using an AEP approach [9], he shows that for large , this expansion can be partitioned into two sets, one set with average power close to that in the original product (the “typical” set), and another set with low average power. The number of terms or coefficients in this “typical” set is about exp , where (1) are the eigenvalues of the process [10], and [15]. Using a result of Grenander and Szego [16], Campbell shows that
which is a per-component, per-unit-time version of . He then argues that a natural measure of the coefficient rate of the random process is (2) Note that the quantity in the exponent of (2) is the differential entropy of the random process power spectral density. Also note that is quite different from the entropy power (denoted as in [17]), which is an often-used rate indicator.
122
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997
Fig. 2.
Activity detection for LPC-10 and spectral entropy.
Motivated by Campbell’s result, we use an approximation for to derive some useful indicators of speech activity in terms of coefficient rate. However, instead of using the eigenvalues of the process in (1), we can truncate the sum and substitute the normalized discrete cosing transform (DCT) coefficients for , so that (3) where (4)
and are the ac coefficients of the DCT of an -block of speech samples. In this case, (1) corresponds to the spectral entropy described in [18]. B. Simulations Smoothed plots of the spectral entropy of (3) are shown in Fig. 1 for a fullband speech signal and the upper and lower halfband components of the signal that were produced by finite impulse response (FIR) filtering. Entropy measurements were taken from the signals at 5 ms spacings using 10 ms DCT windows (50% overlap). Also shown in the figure is a representative curve showing signal energy for the case of
very low background noise. Notice that careful interpretation of the split-band spectral entropy curves with respect to the fullband curve provides information similar to that contained in the energy curve. The regions of “silence” in the speech signal are indicated by nulls in the energy curve, and by broad peaks in the entropy curves. The regions of heavy voicing activity show increased energy and decreased spectral entropy. However, the range of the spectral entropy is normalized and so it does not suffer from the same difficulties as energy-based measures due to background noise. Further, a comparison of the lower halfband entropy and the fullband entropy reveals a rough estimate of the time-varying bandwidth of the signal. High correlation between the fullband and one of the halfband entropy curves indicates compaction of signal energy into that halfband, and very low entropy values describe the presence of a textured, or predictable spectrum. Conversely, high values of entropy indicate a flat spectrum. The lower curves of Fig. 1 indicate representative regions of halfband/fullband concentration and voice activity detection for the utterance which were determined using relative levels of the split-band spectral entropies. The entropy-based decision rule for the declaration of halfband regions is conceptually very simple. Whenever the entropy of the upper halfband signal is higher than the entropies of both the fullband and lower halfband, then the high-frequency spectrum is significantly flatter than the low-frequency spectrum, and so if high correlation exists between the fullband and lowband entropy
MCCLELLAN AND GIBSON: VARIABLE-RATE CELP
signals, we declare the segment to be confined to the lower halfband, or “halfbanded.” This type of segmentation has been used effectively in a scalar analysis-by-synthesis coder (“tree coder”) to determine segments which are candidates for 2:1 subsampling of the residual [19]. The entropy-based decision rule for “silence” or inactive regions is simpler than that of the halfband determination. When the fullband and both halfband entropies are very close together and exceed a “whiteness” threshold, we use this as an indication of the absence of significant correlation in the signal, and declare the segment to be “silence.” Clearly, the definitions of “very close together” and the “whiteness threshold” are ambiguous and to some extent application-dependent. Here, we define these quantities in terms of the smoothed time series of entropies. To smooth the raw time series of entropy values, which are computed from a 10 ms, Hamming windowed, 50% overlapped normalized DCT formulation, we use a moving average of five adjacent entropy values. For an instantaneous indication of silence, the smoothed entropy values for the fullband, lowband, and highband must be within 0.0375 of each other and greater than 0.95 or all must be above 0.9625. The instantaneous silence indications are then further smoothed with a three-point median filter to remove spikes.
123
Fig. 3.
Alignment of activity detectors (female speaker).
C. Robustness to Additive Noise In Fig. 1, the spectral entropy analysis was performed only on “clean” speech having an extremely high signalto-noise ratio (SNR). The use of CELP coders in many environments requires the study of coding algorithms in the presence of significant ambient noise. In the noisy case, the robustness of variable-rate coders that make special use of “silent” segments to lower the encoding rate hinges on the discrimination between voice activity and background sounds. In [4], the QCELP rate-variation cues are shown to deteriorate quickly in the presence of noise, leaving the coder to operate at a high average rate. Here, we consider the use of the spectral entropy of (3) as a VAD in the presence of spectrally flat background noise—additive white Gaussian noise (AWGN). In particular, we note that a characteristic of the entropy functional is that it approaches its maximum value for target distributions which are “flat.” Thus, values of spectral entropy that are relatively large indicate a flat spectrum. We conclude that simultaneous indications of flatness (high entropy) in both the upper and lower halfband signals and the fullband signal are a reasonable indicator of the absence of speech activity in the presence of white noise. To validate this conclusion, we compare the V/UV logic of Federal Standard 1015 (LPC-10e) with a simple decision rule derived using spectral entropy. We use both decision rules over a range of noise variances for AWGN from “clean” speech to below 1 dB peak SNR. The results of this comparison are shown in Fig. 2 for several SNR’s and a particular utterance (female speaker). In the figure, high values of the “H” method and “LPC-10e” decision curves indicate inactive regions. Note that the spectral entropy decision rule (“H” method) produces activity indicators that are quite similar to those produced by
Fig. 4. Alignment of activity detectors (male speaker).
the LPC-10 logic and the reference (labeled ref UV) that was gathered by hand. Regions of significant differences between the two indicators (for example around 15 000 samples in the 1-dB curves and 7000 samples in the 10-dB curves) indicate detection of voicing transitions by the entropy method. Closer inspection of certain regions of the signal reveal that the entropy method may provide a more sensitive indicator of activity than is available with the LPC-10 technique. A “clean” segment of the utterance used to produce Fig. 2 is shown in Fig. 3. Notice that both detectors produce an accurate indication of the start and duration of the activity, whereas the “H” method seems to have better detection of the endpoint of the active region, especially in the presence of significant noise. Fig. 4 is similar to Fig. 3 but is taken from a different utterance (male speaker). Notice that the LPC detector completely misses the activity in the 1-dB noise case,
124
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997
and clips an active segment short even in very high SNR, whereas the “H” method produces a more accurate indication of spectral activity. Our experiments with decision rules using spectral entropy reveal that this measure may be useful in determining practical rate allocation schemes for variable-rate coders. In particular, our results show the utility of the split-band spectral entropies in the determination of halfband concentration and voice activity, both of which are useful for encoding at low rates. Further, the use of spectral entropy as a simple voice activity detector seems to have a more robust response in the presence of additive white background noise than the sophisticated classification technique used in LPC-10e. III. FORMANT CHARACTERIZATION Effective rate variation during active speech is another necessary component of variable-rate encoders. The most notable approaches to this problem switch between different CELP configurations based on some segmentation criterion. For example, the phonetically segmented vector excitation coder (PSVXC) [3] and variable-rate phonetic segmentation (VRPS) [4] coders use a finite state machine to switch between CELP configurations that are optimized for various “phonetic segments” (silence, unvoiced, onset, voiced, etc.). Our approach here is similar to PSVXC, VRPS and QCELP in that it is a variable-rate CELP coder with fixed frame rates and variable bit allocations per frame. However, we use spectral flatness measures to focus coding efforts on efficient variable rate techniques for spectral parameters and excitation as well as for the VAD. We use open-loop estimates of spectral shape to roughly determine bandwidth, and bit allocations for spectral parameters and excitation vectors are derived from these measures. A. Line Spectral Pairs Efficient representation of the short-term spectrum is an important part of most speech coding techniques. This is also an area in which variable-rate speech coding techniques can be applied with good success. For a given model order ( ), linear predictive analysis of a time series results in a model which can be described by an all-pole filter
where the parameters are commonly referred to as the linear prediction coefficients (LPC’s). An equivalent representation of the LPC parameters is the set of line spectral frequencies (LSF’s) or line spectral pairs (LSP’s) [20], [21]. LSP’s have convenient properties relating to well-behaved dynamic range and preservation of filter stability and can be used to encode LPC spectral information more efficiently than other representations [22]–[24]. Many of the properties of LSP’s have been examined in alternative quantization schemes for LPC parameters [22], [24]–[26]. Vector quantization is especially appealing for LSP sets due to the high correlation between neighboring spectral
Fig. 5.
12-b VQ of lowpass segment, lowpass distortion measure.
lines. Unfortunately, quantization with a single codebook requires some 20 to 40 b/frame or set of coefficients. Some of the most interesting results that address the problem of implementing LSP VQ’s relate to quantization of independent subsets of LSP vectors that correspond roughly to upper and lower halfbands. This technique has been called “splitvector quantization” (sVQ) and can achieve transparent spectral quantization1 with a fixed 24 b/frame. In two-band sVQ, the lower four LSP’s and upper six LSP’s are quantized with independent 12-b codebooks using a weighted Euclidean distance measure [26]. B. Variable-Rate Split VQ Since the LSP’s conform to the definition of a distribution function (see the Appendix), a split-band entropy criterion on the unquantized LSP’s can be used to bypass the highband codebook search (upper six LSP’s) for vectors with limited high-frequency texture. In this case, for a lowband codebook containing fullband LSP information,2 the lowband codebook can be searched with a distortion measure of higher dimension (wider bandwidth) to improve the wideband spectral match. To illustrate the effect of distortion measure bandwidth in the codebook search, Fig. 5 shows quantized and unquantized 1 Here, “transparent” is defined by conditions on the average spectral distortion (SD) for a large set of LPC test vectors [26]. 2 The weighted distance measure can be focused, or restricted to a particular subset of LSP’s by adopting zero-valued weights for LSP’s outside the subset of interest. In this case, the clustering process involves distance computations which use a subset of each vector whereas the centroid computation (including accumulation) still involves the entire vector of LSP’s. This allows for entries in an sVQ codebook to carry average LSP information for the entire frequency range of the training vectors in the cluster, although the characteristics of the cluster are determined by the limited-dimension distance measure.
MCCLELLAN AND GIBSON: VARIABLE-RATE CELP
125
Fig. 7. Splitband spectral distortion versus entropy. Fig. 6.
Fullband spectral distortion versus entropy.
spectral envelopes of a tenth-order LSP representation. Two of the quantized envelopes used a single 12-b codebook, which was clustered using a fourth-order, lowpass distortion measure (low four LSP’s), but searched using a distortion criterion classified as either narrowband (low four LSP’s) or wideband (low eight LSP’s). These spectra are collinear. Also shown in the figure is the envelope quantized with fullband 24-b sVQ as in [26]. The texture of the spectrum of Fig. 5 indicates a lack of significant correlation at high frequencies, so a separate, highquality, 12-b encoding of the upper six LSP’s is not necessary. The residual high-frequency information contained in a vector from the lowband-optimized codebook which has been trained with fullband information can be used for a slightly degraded representation of the fullband spectral envelope. For example, note that the envelopes for the 12-b narrowband and 12b wideband distortion measures are identical in the case of Fig. 5, and that a 24-b sVQ encoding reduces the lowband SD by only 0.6 dB on [01000] Hz. Also note that the minor highband distortion incurred by either 12-b encoding is more than 30 dB down from the level of the lowband spectral peak. The use of the entropy functional is convenient for describing the flatness of spectra since entropy is related to the discrimination of a target distribution with respect to the uniform distribution. The use of this functional on appropriately normalized spectral data (as in the case of spectral entropy in Section II) is helpful in describing the flatness of the spectrum with respect to an average energy level. Thus, an entropy-based criterion is a natural choice for a procedure which selectively bypasses the upperband sVQ codebook search to achieve a variable-rate scheme for encoding formant LSP parameters. In sVQ, the upper-band codebook is designed to minimize distortion on the six high-order LSP’s. A measure of necessity for the bits required by the upperband codebook indices can be obtained by computing the subband entropy from the six high-order LSP’s of the unquantized spectrum. Codebook information is unnecessary for spectra whose entropies exceed a preselected threshold, since a high value of entropy indicates a flat spectrum. In this case, residual high frequency information
contained in the lowband sVQ codebook may be adequate for fullband quantization. The average fullband, lowband, and highband SD incurred in such a variable-rate sVQ scheme are shown in Figs. 6 and 7 for a range of upperband entropy thresholds. The entropy values were normalized according to the size of the upperband LSP alphabet, and the coding distortion was computed as an ensemble average of SD for a large collection of LSP vectors. These vectors were encoded with our variable-rate sVQ scheme at each of several upperband entropy thresholds. At each entropy threshold, the SD per-vector was computed for the frequencies spanned by the low four and high six LSP’s as well as for the fullband case. Frequency-weighted distortion measures of dimension 5, 8, or 10 were used in searching the lowband codebook3 for vectors where the entropy threshold for the high six LSP’s was exceeded (ie. the upperband spectral envelope was sufficiently “flat”). The separate curves in the figures demonstrate the behavior of SD versus entropy threshold for the various distortion measures. Note that the entropy threshold is equivalent to average coding rate here, since it represents a tolerance level for requiring either a 12-b or a 24-b formant encoding. Note from Fig. 7 that the eight-dimensional distortion criterion (labeled “8-dim”) maintains lowband performance which is very close to that of the five-dimensional case for all entropy thresholds. This is not true of the ten-dimensional distortion, which sacrifices lowband performance to improve the fullband match. Further, note that the 8-dim.distortion measure produces average SD that is better than the 5-dim. SD in the high-frequency band. This is due to the wider bandwidth of the 8-dim. distortion measure. Thus, the 8-dim. distortion measure achieves a lowband distortion comparable to the 5dim. distortion and much better than the 10-dim. distortion. The 8-dim. distortion also greatly improves the highband SD over that achieved by the 5-dim. distortion. 3 The codebook was clustered by minimizing frequency-weighted distortion for the low four LSP’s but centroid computations (accumulation, averaging) used information from all ten LSP’s.
126
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997
CODEBOOK
AND
TABLE I DISTORTION CONFIGURATIONS
OBJECTIVE RESULTS
FOR
FOR
SPLIT-VQ
OBJECTIVE RESULTS
FOR
TABLE III SPLIT-VQ ENCODING (MALE SPEAKER)
TABLE II SPLIT-VQ ENCODING (FEMALE SPEAKER)
C. Experimental Results We designed codebooks for the variable 12/24-b, fixed 24-b, and fixed 18-b sVQ’s of tenth-order LSP’s as shown in Table I. These codebooks were used in a variable-rate CELP coder to obtain the following objective and subjective results. During encoding, the CELP excitation gain and long-term spectral parameters were not quantized. Each LPC computation (20 ms intervals) had two lag updates and four pitch coefficient updates, as well as four searches through the excitation codebook. The excitation codebook contained 256 center-clipped Gaussian vectors. The entropy values used in the variable-rate 12/24-b scheme were normalized to [0,1], and upperband codebook searches were bypassed for spectra with entropies greater than the threshold. Based on the observations pertaining to the lowband and fullband SD versus entropy (Figs. 6 and 7), we use Paliwal’s peak-weighted distortion measure of dimension eight for searching the sVQ lowband codebook during frames where upperband encoding is not required. Simulation results for our CELP system for a male and female speaker are shown in Tables II and III for a small range of sVQ entropy thresholds. Also shown are results for fixed-rate 24-b, 18-b, and 12-b fullband sVQ encoding, and for
unquantized LSP’s. The results for segmental SNR (SNRSEG) and the female speaker indicate that the variable-rate sVQ is objectively equivalent to the fixed 18-b sVQ at an entropy threshold of around 0.85, which corresponds to an average rate of around 19 b per frame. Equivalent SNRSEG for the male speaker occurs around an entropy threshold of 0.78 which corresponds to an average rate of about 16 b per frame. For the usual SNR, 18-b equivalence occurs close to 0.81 (16.9 b/frame) for the female, and around 0.79 (16.2 b/frame) for the male. The 18-b equivalence for frequency-weighted SNR occurs around 0.8 (16 b) for the female and around 0.8 (17 b) for the male. These rows are indicated in the tables. However, objective performance measures are often misleading in speech processing research, and in assessing the performance of complex and nonstandard techniques, subjective measures are sometimes easier to interpret. To complete our analysis, we performed a paired comparison test between the fixed rate 18-b sVQ CELP-coded sequences and the variable-rate sVQ sequences for several entropy thresholds. Untrained participants chose which component of the pair “sounded better,” and these results were compiled based on a percentage of responses favorable to variable-rate sVQ. Results of these comparisons are shown in Table IV. In the table, the column labeled “Entropy thr” contains the value of entropy which was used to determine the coding mode (12-b or 24-b) for each frame of LSP’s, based on the entropy of the unquantized, high-order 6 LSP’s for the frame. This threshold produces an average encoding rate for a sequence, which is shown in the column labeled “Avg. Rate” in b/frame. The columns labeled “(18-b,sVQ)” and “(sVQ,18-b)” denote the percentage of listeners who preferred the variable-rate sVQ scheme over the fixed-rate 18-b sVQ scheme in an (A,B) paired comparison test. The results are shown separately for both (A,B) and (B,A) presentations of the sequences in the test since the order of the pairings seemed
MCCLELLAN AND GIBSON: VARIABLE-RATE CELP
TABLE IV VARIABLE-RATE SVQ PREFERENCE FROM PAIRED COMPARISON WITH 18-B FIXED-RATE SYSTEM
127
used since variable-rate sVQ tends toward a 12-, fixed-rate, single codebook scheme where the codebook is optimized for the low four LSP’s. D. sVQ Conclusions The computational complexity of split-vector LSP VQ is reduced from the single codebook case due to the use of smaller, relatively independent codebooks. This makes VQ of LSP parameters feasible. CELP-based, variable-rate sVQ based on an upper-band entropy consideration is shown to produce objective results equivalent to a fixed 18-b sVQ scheme with an average rate of 17–18 b/frame (SNRSEG). However, results of subjective testing indicate that the variable-rate scheme can provide equivalent CELP quality at a rate of 16–17 b/ frame for the same CELP configuration. IV. LAG-INDEXED VQ
to be important in the perception of quality, especially for the male speaker. The rows highlighted in the table denote the lowest entropy threshold tested below which the 18-b sVQ was preferred in both (A,B) presentations and overall. Above the highlighted rows, the variable-rate sVQ scheme was preferred either overall or consistently in a particular pair presentation. In the listening test, 30 untrained listeners chose between 18 different pairs of sentences in both (A,B) and (B,A) presentation for a total of 36 pairs. This experiment was repeated for several different Harvard sentences and the average results for male and female speakers are reported here. Of the 18 (A,B) pairs for each utterance, one pair contrasted the unquantized sequence with the 18-b sVQ. Fifteen (A,B) pairs contrasted the 18-b sVQ with the variable-rate sVQ at an entropy threshold between 0.74 and 0.88. Two additional pairs contrasted the unquantized sequence with the variable-rate sVQ at entropy thresholds of 0.70 and 0.90. A subset of these results for the 15 entropy thresholds of interest are listed in Table IV. The results comparing with the unquantized sequence were used for “sanity checks.” In most cases, the variable-rate sVQ speech quality was at least comparable to the quality of the speech reconstructed using the 18-b codebook, since the variable-rate scheme used 24 b in wideband segments, whereas a maximum of 18 b was available in the fixed-rate scheme. 50% equivalence between fixed-rate 18-b sVQ and variable-rate 12/24-b sVQ seems to occur around an entropy threshold of 0.8 (16.4 b) for the female and around 0.8 (17 b) for the male speaker. These rows are highlighted in the table and the equivalent rates are similar to the objective results. Significant degradation occurs in variable-rate sVQ when a very low entropy threshold is
Lag-indexed VQ (LIVQ) is an approach to the encoding of multitap pitch filter coefficients that incorporates the favorable characteristics of current single-tap pitch coding schemes while maintaining low average rate and enhancing synthetic speech quality [11], [12]. In LIVQ, we use the nonuniform distribution of lags in a large training set to guide the training and bit allocation of VQ codebooks instead of enumerating noninteger lags as has been discussed in the literature [27]. This improves the performance of the pitch predictor by making use of the prediction gain of higher order filters as well as the inherent noninteger lag resolution of the multitap configuration. In short, we train separate multiple tap codebooks that use training data for particular ranges of lags. This focuses the distortion measure of the training process on codebook elements which are representative of pitch filters for specific lags instead of for all possible lags. In this fashion, fewer vectors per codebook are sufficient to produce a high code vector density per lag, and this produces an acceptable level of distortion for the filter coefficients. Thus, we can design several independent codebooks each requiring fewer bits, and the number of bits and filter order can be tailored to the disjoint subsets of lags. This produces a variable-rate, variabledimension scheme that is driven by the estimated pitch lag. The coefficient selection (codebook search) can be optimized in an open-loop or closed-loop fashion based on the computational requirements of the target system. The error minimization can also be performed jointly with the lag estimation, if desired. In related testing with other CELP coders, LIVQ has provided objective and subjective results equivalent to the results obtained using a single 7-b VQ, which is updated at the same frequency and which requires 50% more bits per frame. To further examine this approach, we implemented an LIVQ scheme in the FS-1016 algorithm in place of the usual adaptive codebook search, and demonstrated that the subjective performance of the FS-1016 algorithm can be improved without an increase in encoding rate via the incorporation of an LIVQ pitch filter encoder. We have also shown that consistent subjective performance can be maintained at a lower encoding rate when LIVQ techniques are employed in FS-1016 [28].
128
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997
TABLE V BIT ALLOCATIONS FOR THE VARIABLE-RATE CELP CODER. EACH NONSILENCE FRAME CONTAINS FOUR EXCITATION SUBFRAMES
TABLE VI SYSTEM CONFIGURATION FOR 28-MSEC FRAME LENGTH
V. VARIABLE-RATE CELP The combination of our spectral entropy VAD and variablerate sVQ, as well as the use of LIVQ for pitch parameters in a CELP architecture, results in a multimode coder that draws its primary coding mode cues from subband measures of spectral flatness. Particular information about the configuration of the coder for a 28-ms LPC frame rate is shown in Table VI, and Table V shows the bit allocations for the each of the four coding modes. In CELP coders, the choice of frame length can be used to modulate the encoding rate. Unfortunately, long frames produce an objectionable muffling in the encoded speech, especially in segments that are evolving quickly. However, for our coder to achieve the desired average rate with short frames requires the use of a very low sVQ entropy threshold so that the 12-b upperband codebook index is seldom transmitted. This severely limits the opportunities available to the coder to use more than the 12 b provided by the lowband sVQ codebook, and the encoded speech quality suffers. The purpose of the variable-rate sVQ is to allow the coder to use the full
24-b resolution of the split LSP codebooks when necessary to maintain optimal quality. Thus, for best quality and rate performance we compromise and use relatively long 28-ms frames to achieve the desired 2-kbps average encoding rate and maintain a high entropy threshold for variable-rate sVQ. Frames occurring during a transition from silence to a voiced coding mode can be processed in a special fashion. These frames are often called onsets, since they represent the onset of an active voicing segment. Onset coding has been studied carefully in [3] and [4], and has been shown to be important in low-rate speech coding. Usual approaches include special codebooks, faster spectral updates, and higher encoding rates to combat the nonstationary and quickly evolving nature of the onset waveform. The processing used here for onset frames includes fast, low-resolution, fullband updates of LPC parameters as well as a limited pitch filter representation and a large, sparse excitation codebook. The determination of onset frames is performed at 5-ms increments based on the output of the VAD, and so does not incur further framing delays. A transition from silence to voiced mode between 5-ms VAD subframe boundaries causes the entire frame to be immediately declared as an onset frame, and processing proceeds accordingly. Simulation results that illustrate the effectiveness of our variable-rate coder for a male and a female speaker are shown in Tables VII and VIII. Similar results were obtained for other Harvard sentences. The tables show SNR, segmental SNR, and frequency-weighted SNR for encoding sequences in a fully quantized CELP configuration. In the tables, the data is indexed by the sVQ entropy threshold, and two encoding rates are shown. The column labeled “Actual” rate is the encoded rate for the limited-duration Harvard sequence. The column labeled “VAF” is the encoded rate with the sequence artificially extended to correspond to the standard assumption of a 50% voice-activity factor. It is interesting to note from the tables the implicit relationship between the proportion of fullband/halfband frames and the encoding rate for all sentences. The target encoding rate for this research was an average of 2 kbps. This target rate was achieved around an entropy threshold of 0.80 for each of the Harvard sentences. Around this entropy threshold, the proportion of fullband and halfband frames for each utterance was approximately balanced, resulting in an average rate of about 18 b/frame for formant parameters. The ensemble results of variable-rate sVQ for lowband SD with 8-dim. distortion (Fig. 7) start to diverge from the 5-dim. result around this entropy threshold. This indicates an inability of the lowband LSP codebook to maintain the same coding fidelity for the wideband distortion measure as it can for the narrowband distortion (the distortion criterion for which it was optimized during design) as the entropy threshold decreases. Higher entropy thresholds utilize the lowbandoptimized codebook for wideband coding a smaller proportion of the time, and only for coding spectral envelopes that do not have significant high-frequency texture. Thus, slight envelope errors are less audible and occur less frequently as the entropy threshold increases (the definition of highband “flatness” becomes stricter). Conversely, envelope errors tend
MCCLELLAN AND GIBSON: VARIABLE-RATE CELP
TABLE VII OBJECTIVE RESULTS (FEMALE, 28-MSEC FRAMES)
129
produce a significant proportion of LPC frames that have limited high-frequency texture. These frames can be coded efficiently with the lowband codebook using a wideband distortion. In this fashion, the use of an entropy criterion on the upperband codebook guarantees the availability of increased resolution for the spectral envelope when it is necessary and allows for low-rate coding otherwise. VI. CONCLUSIONS
TABLE VIII OBJECTIVE RESULTS (MALE, 28-MSEC FRAMES)
We have discussed variable-rate speech coding in a CELPstyle architecture that uses entropy-based measures of spectral flatness to derive primary coding mode cues. This coder produces communications quality encoded speech at an average rate of around 2 kbps with an appropriate sVQ entropy threshold and the assumption of a 50% voice-activity factor. In addition to proposing a novel and robust voice-activity detection scheme derived from DCT-based spectral entropy, we have explored the use of the entropy functional on other spectral representations, including the set of LSP’s. We have demonstrated a variable-rate technique that minimizes subjective coding distortion and rate for sVQ of LSP parameters during LPC frames that have narrowband spectral texture. Our focus on coding methods for active speech also includes the development of a variable-rate, variable-dimension, lag-driven scheme for quantizing pitch filter coefficient vectors [11], [12]. The technique, which we call lag-indexed VQ, uses the distribution of lags for a large collection of utterances to influence the bit allocation and training strategy for several independent codebooks of pitch filter coefficients. This approach has been shown to produce subjective improvements in CELP-style coders [12], [28] without a corresponding increase in bit rate. APPENDIX The interlacing property of the roots of and coupled with the normalization of the sampling frequency provides an interpretation of the resonances of the spectral envelope as a probability distribution, and so the set of differences between successive LSP locations corresponds to a probability mass function (pmf). With this interpretation, various information-theoretic tools such as entropy, mutual information, and relative entropy [9] are applicable to the pmf of vocal tract resonances. Using the notation , it is clear that and . Due to the interlacing property, , and so for
to be more audible/objectionable and occur more frequently as the entropy threshold decreases (the definition of highband “flatness” loosens). In the case of a low-entropy threshold, a proportion of the LSP vectors encoded with only the lowband codebook have significant high-frequency spectral texture. The wideband distortion criterion for the lowband codebook takes into account the high-frequency peaks, and this degrades the fidelity of the low-frequency envelope. These low-frequency errors tend to be audible as rumbles or large amplitude distortions. Fortunately, the lowpass nature of voiced speech tends to
we have that for . Since is (strictly) increasing and takes minimum value 0 and maximum value 1, can be interpreted as a discrete distribution function that represents the locations of the vari. The heights of the discontinuities in able roots of are probabilities regarding the containment of spectral resonances (poles) in the angular region between consecutive
130
LSF’s. So, for Pr
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997
the th resonance of the vocal tract
For a more intuitive interpretation of the pmf described by the differences in LSP locations, , it is convenient to work with . This assigns higher probability to small LSP differences which correspond to spectral peaks. REFERENCES [1] W. Gardner, P. Jacobs, and C. Lee, “QCELP: A variable rate speech coder for CDMA digital cellular,” in Speech and Audio Coding for Wireless Networks, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1993, pp. 85–92. [2] S. Vaseghi, “Finite state CELP for variable rate speech coding,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, Apr. 1990, pp. 37–40. [3] S. Wang and A. Gersho, “Phonetically-based vector excitation coding of speech at 3.6 kbps,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Glasgow, UK, May 1989, pp. 49–52. [4] E. Paksoy, K. Srinivasan, and A. Gersho, “Variable rate speech coding with phonetic segmentation,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Minneapolis, MN, Apr. 1993, pp. II. 155–II.158. [5] A. Das, E. Paksoy, and A. Gersho, “Multimode and variable-rate coding of speech,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds. Amsterdam, The Netherlands: Elsevier, 1995, pp. 257–288. [6] J. Campbell, V. Welch, and T. Tremain, “The new 4800 bps voice coding standard,” in Proc. Military and Government Speech Technology , Nov. 1989, pp. 735–737. [7] I. Gerson and M. Jasiuk, “Vector sum excited linear prediction (VSELP) speech coding at 8 kb/s,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, Apr. 1990, pp. 461–464. [8] J.-H. Chen, “High-quality 16 kb/s speech coding with a one-way delay less than 2 ms,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Albuquerque, NM, Apr. 1990, pp. 453–456. [9] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [10] L. L. Campbell, “Minimum coefficient rate for stationary random processes,” Inform. Contr., vol. 3, pp. 360–371, 1960. [11] S. McClellan and J. Gibson, “Variable rate CELP based on subband flatness,” in Proc. IEEE Int. Conf. Commun., Seattle, WA, June 1995, pp. 1409–1413. [12] , “Lag-indexed VQ for pitch filter coding,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Atlanta, GA, May 1996. [13] P. Brady, “A technique for investigation on-off patterns of speech,” Bell Syst. Tech. J. , vol. 44, pp. 1–22, 1965. [14] Y. Yatsuzuka, “A highly sensitive speech detector and high-speed voiceband data discriminator in DSI-ADPCM systems,” IEEE Trans. Commun. , vol. 30, pp. 739–750, Apr. 1982. [15] N. Abramson, “Information theory and information storage,” in Proc. Symp. System Theory, Brooklyn, NY, Polytech. Inst. Brooklyn, Apr. 1965, vol. XV, pp. 207–213. [16] U. Grenander and G. Szego, Toeplitz Forms and Their Applications. Berkeley, CA: Univ. Calif. Press, 1958. [17] J. D. Gibson, S. P. Stanners, and S. A. McClellan, “Spectral entropy and coefficient rate for speech coding,” in Rec. 27th Ann. Asilomar Conf., Pacific Grove, CA, Nov. 1993, pp. 925–929. [18] R. Mester and U. Franke, “Spectral entropy-activity classification in adaptive transform coding,” IEEE J. Select. Areas Commun. , vol. 10, pp. 913–917, June 1992. [19] S. McClellan and J. Gibson, “Variable rate tree coding of speech,” in Proc. IEEE Wichita Conf. Commun., Networking, and Signal Processing, Wichita, KS, Apr. 1994, pp. 134–139. [20] F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signals,” J. Acoust. Soc. Amer., vol. 57, no. 1975. [21] H. Wakita, “Linear prediction voice synthesizer: Line-spectrum pair (LSP) is the newest of several techniques,” Speech Technol., vol. 1, pp. 17–22, Fall 1981. [22] F. K. Soong and B. H. Juang, “Line spectrum pair (LSP) and speech data compression,” in Proc IEEE Int. Conf. Acoust., Speech, and Signal Processing, San Diego, CA, Mar. 1984, pp. 1.10.1–1.10.4. [23] G. Kang and L. Fransen, “Application of line-spectrum pairs to low-bitrate speech encoders,” in Proc IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Tampa, FL, Mar. 1985, pp. 244–247.
[24] N. Sugamura and N. Farvardin, “Quantizer design in LSP speech analysis-synthesis,” IEEE J. Select. Areas Commun., vol. 6, pp. 432–440, Feb. 1988. [25] F. K. Soong and B. H. Juang, “Optimal quantization of LSP parameters,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 15–24, Jan. 1993. [26] K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPC parameters at 24 b/frame,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 3–14, Jan. 1993. [27] P. Kroon and B. S. Atal, “On improving the performance of pitch predictors in speech coding systems,” in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1991, pp. 321–327. [28] K. Rutherford, S. McClellan, and R. Adhami, “Improving the performance of Federal Standard 1016 (CELP),” in Proc. IEEE Southeast Conf., Tampa, FL, Apr. 1996.
Stan McClellan (SM’90–M’95) received the B.S. degree in electrical engineering from Texas A&M University, College Station, in 1986. Following a career in the aerospace and defense industry with LTV Missiles & Electronics, Grand Prairie, TX, and General Dynamics, Ft. Worth, TX, he returned to Texas A&M under a graduate fellowship where he received the M.S. and Ph.D. degrees in 1991 and 1995, respectively, in electrical engineering. At Texas A&M, he also served as a Research Assistant in the Telecommunications, Control, and Signal Processing Research Center, and as an Assistant Lecturer for the Department of Electrical Engineering. He has been with the University of Alabama, Birmingham (UAB), since July, 1995, where he is an Assistant Professor in the Department of Electrical and Computer Engineering, and a Research Engineer in the UAB Center for Telecommunications Education and Research. His current research interests include digital signal processing, data compression, time series analysis, information theory, and high-speed computer networks. Dr. McClellan is a member of the IEEE Societies of Signal Processing, Communications, and Information Theory.
Jerry D. Gibson (F’92) received the B.S. degree from the University of Texas, Austin, in 1969, and the M.S. and Ph.D. degrees from Southern Methodist University, Dallas, TX, in 1971 and 1973, respectively, all in electrical engineering. He was with General Dynamics, Fort Worth, TX, from 1969 to 1972, The University of Notre Dame, South Bend, IN, from 1973 to 1974, and the University of Nebraska, Lincoln, from 1974 to 1976. He currently holds the J. W. Runyon, Jr., Professorship in the Department of Electrical Engineering at Texas A&M University. His research interests include data, speech, image, and video compression; multimedia over networks; wireless communications; information theory; and digital signal processing. He is coauthor of Introduction to Nonparametric Detection with Applications (New York: Academic 1975; and IEEE Press, 1995). He is author of Principles of Digital and Analog Communications (Englewood Cliffs, NJ: PrenticeHall, 1993). He was Associate Editor for speech processing for IEEE TRANSACTIONS ON COMMUNICATIONS from 1981 to 1985 and associate editor for communications for IEEE TRANSACTIONS ON INFORMATION THEORY from 1988 to 1991. He is Editor-in-Chief of The Mobile Communications Handbook (CRC, 1995), editor-in-chief of The Communications Handbook (CRC Press, 1996), and editor of the IEEE Press Series on Signal Processing. Dr. Gibson has served as a member of the Speech Technical Committee of the IEEE Signal Processing Society (1992–1995) and is currently a member of the IEEE Information Theory Society Board of Governors (1990–1996). He is a member of the Editorial Board for PROCEEDINGS OF THE IEEE. He currently serves as president of the IEEE Information Theory Society. In 1990, he received the Frederick Emmons Terman Award from the American Society for Engineering Education. In 1992, he was elected Fellow of the IEEE. He was co-recipient of the 1993 IEEE Signal Processing Society Senior Paper Award for speech processing.