18
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
Efficient Pitch Filter Encoding for Variable Rate Speech Processing Stan McClellan, Member, IEEE, Jerry D. Gibson, Fellow, IEEE, and B. Keith Rutherford
Abstract—Analysis-by-synthesis techniques are used in a wide variety of speech coding standards and applications for rates below 16 kbps. The presence of a long-term predictor, commonly known as the adaptive codebook, is critical to coder performance at the lower rates. Unfortunately, the encoding rate and computational requirements for high-quality encoding of pitch filter parameters can be excessive. Several popular approaches explore the trade-off between predictor order, allocated bit rate, and computational requirements for long-term predictor optimization. Here, we investigate the relative performance of several longterm predictor structures and present a new approach to vector quantization of pitch filter coefficients having subjective quality equivalent to other schemes, but at a lower coding rate and requiring significantly less closed-loop computation. Performance is evaluated in a variable-rate CELP coder at an average rate of 2 kbps and in Federal Standard 1016 CELP. Index Terms— Pitch filter, quantization, variable rate, vector quantization.
I. INTRODUCTION
A
NALYSIS-BY-SYNTHESIS techniques represent a significant advance in medium- and low-rate speech coding, and they are used in a wide variety of standards and applications for rates below 16 kbps [1]. The presence of a long-term or pitch predictor, sometimes called the adaptive codebook, is critical to coder performance at the lower rates. Unfortunately, the rate required for high-quality encoding of pitch filter parameters is often a large fraction of the total available coder bandwidth, especially for very low rate coders [2], [3]. The application of analysis-by-synthesis techniques to pitch filter optimization can also produce high computational requirements [4]. Here, we investigate the performance of a new approach to vector quantization (VQ) of pitch filter parameters. This new approach has subjective quality equivalent to other VQ coding schemes at a lower rate, and requires significantly less computation in an analysis-by-synthesis architecture. The Manuscript received October 6, 1996; revised October 5, 1998. This research was supported in part by the National Science Foundation under Grant NCR-9303805, by the Texas Advanced Technology Program under Project 999903-017, and by a UAB Faculty Research Grant. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Huseyin Abut. S. McClellan is with the Department of Electrical and Computer Engineering, The University of Alabama at Birmingham, Birmingham, AL 35294-4461 USA (e-mail:
[email protected]). J. D. Gibson is with the Department of Electrical Engineering, Southern Methodist University, Dallas, TX 75275-0338 USA. B. K. Rutherford is with Southern Computer Systems, Birmingham, AL 35233 USA. Publisher Item Identifier S 1063-6676(99)00181-9.
performance of the method is evaluated in the context of a variable-rate CELP coder operating at an overall average rate of 2 kbps and in the context of the fixed-rate Federal Standard 1016 CELP coder. The paper is divided into eight sections that discuss various aspects of pitch filter coding methods. In Section II, we briefly discuss well-known pitch parameter estimation techniques for single-tap and multiple-tap configurations. Section III discusses the relative merits and shortcomings of the single-tap and multiple-tap schemes, and presents other coding techniques that have been proposed in the literature. In Section IV, we describe an approach to vector quantization of multiple-tap pitch filter coefficients that is driven by the estimate of the fundamental pitch period, or lag. We call this approach lagindexed vector quantization (LIVQ) of the pitch parameters. Section V discusses the effectiveness of our VQ technique in the context of a generic CELP architecture in terms of both objective and subjective performance measures. Simulation results are presented for various configurations of LIVQ and results of paired comparison tests are cited which support the usefulness of the approach. We describe the effect of incorporating LIVQ techniques in the well-known Federal Standard CELP coder (FS-1016) in Section VI, and include results of paired comparison tests. In Section VII, some subtle implementation details of pitch filter coding techniques are discussed along with improvements in post filtering which are available as a result of the multitap configuration. Finally, Section VIII summarizes our results and contrasts these results with recent conclusions regarding efficient pitch filter encoding. II. BACKGROUND Pitch redundancy, or long-term redundancy, is a characteristic of voiced speech and is related to the excitation of the vocal tract by short bursts of energy from the glottis. The bursts of energy are due to quasiperiodic vibration of the vocal chords which result in corresponding changes in air flow and volume velocity in the vocal tract. This energy source, which is modeled as a purely periodic excitation in linear predictive coding (LPC) analysis, contributes a harmonically related “fine structure” to the power spectrum of voiced speech. The parameter set for a pitch predictor is typically composed of the pitch period or fundamental frequency (lag) and a predictor coefficients, , centered around this lag. set of Parameter estimation techniques for pitch prediction models can be divided into two broad categories: open-loop and closed-loop. The open-loop category pertains to the classical method of analysis which optimizes predictor performance by
1063–6676/99$10.00 1999 IEEE
MCCLELLAN et al.: EFFICIENT PITCH FILTER ENCODING
19
minimizing mean-squared prediction error for a delay which maximizes a correlation-based cost functional. The closedloop category is a result of research in analysis-by-synthesis speech coding techniques wherein the minimization procedure is concerned with the frequency-weighted difference between the original speech signal and a signal reconstructed from a discrete set of parameters. Often, closed-loop pitch predictors ) and open-loop predictors use use single-tap filters ( ). three-tap filters ( A. Single-Tap Predictors The classical method for open-loop pitch period estimation [5], [6] is based on minimizing the residual variance for a single-tap prediction filter. This technique is also applicable to closed loop schemes if suitable modifications are included for unknown residual subsequences. is For a first-order pitch predictor, the input signal , approximated by the predicted value and the filter coefficient . which is a function of the lag samples is The squared prediction error for a frame of
For a given lag , the optimal value of , which produces
(1) is found from
B. Multiple-Tap Predictors The general multiple-tap pitch predictor (with transfer function
odd) has
(5) Since the pitch lag found from (4) is equivalent to the optimal lag for the single-tap case in (3), the minimization can be performed in either order. For the lag that maximizes the are determined correlation function , the coefficients of by minimizing the residual variance over a frame of speech. The result of this procedure is a covariance-type formulation , or specifically for a three-tap as in the matrix form as noted earlier: case with
(6) For any vector of coefficients , the residual variance is , which has a minimum given by when the coefficients are optimal, value . or III. TRADEOFFS BETWEEN CONFIGURATIONS
(2) where
is the autocovariance . Using (2) in (1) leads to , where is the error function [6] defined by (3)
over a range of lags produces the Maximizing optimal open-loop lag, which is an estimate of the fundamental glottal frequency. The error function has local maxima at lags corresponding to time-domain harmonics of the fundamental pitch period if the signal is periodic. For nonperiodic signals, has no harmonically related local maxima [6]. The correspond exactly to the maxima of the maxima of normalized correlation coefficient [5] defined by (4) since
Several alternative methods have been proposed for lag estimation which offer decreased computational requirements in exchange for less consistent open-loop results [3], [6], [7].
Pitch lags are often represented by integer values at the sampling rate of the speech signal. This produces an estimate of the pitch frequency which has variable precision in the frequency domain as a function lag magnitude. The relative importance of the accuracy of the lag estimate versus the fidelity of the filter coefficient representation is dependent on the filter order. In single-tap configurations, a highly accurate (possibly subinteger) lag is crucial to acceptable operation of the pitch filter. In multiple-tap filters, however, a highquality characterization of the predictor coefficients may be more important than an accurate lag estimate since the filter structure provides a form of lag interpolation. A. Single-Tap Schemes In general, closed-loop predictor optimization requires significantly more computation than open-loop optimization, since the reconstructed signal must be completely synthesized using each possible parameter value. For an unconstrained 7 b lag range with 3-b quantization of a single-tap coefficient, the computational cost can be prohibitive, but closed-loop techniques produce significantly better subjective performance. Many CELP speech coders which produce good-quality synthetic speech use single-tap pitch filters where the lag, , and coefficient, , are jointly optimized in a closedloop fashion [6], [8]–[10]. In single-tap configurations, subinteger delay estimates have been found to improve CELP performance even more noticeably than larger stochastic excitation codebooks [6]. However, to achieve good-quality encoding with a single-tap pitch predictor, an accurate
20
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
representation of the pitch lag is crucial. This requires rapid updates for the lag estimate, which increases the encoding rate and computational complexity [3], [6]. Two recent contributions to improving lag accuracy and minimizing the required transmission rate for a single-tap filter are fractional lag resolutions [6], [7] and restricted pitch deviation coding [3]. A variation of this strategy is used in FS-1016 with fractional lags. The use of fractional lags in [6] amounts to time-domain interpolation to increase the sampling rate of the input sequence during determination of the optimal lag and (jointly in the closed-loop case) the filter coefficient. With a fractional lag, the interpolation procedure is required during pitch filter optimization and during pitch filter memory subtraction (before searching the stochastic codebook), as well as in the decoder [4], which is computationally costly. This approach to increasing lag accuracy also requires a corresponding increase in bit rate since the upsampling factor and the fractional lag must be transmitted to the receiver. For improved performance, the frequency of the lag updates should also correspond to the update rate for the filter gain to preclude a mismatch between filter and data [3]. This may require additional rate. Restricted pitch deviation coding (RPDC) [3] is an approach to minimizing the number of bits required to represent a subsequence of pitch lags. By taking advantage of the fact that the pitch period varies smoothly and slowly within voiced segments, a form of differential pulse code modulation (DPCM) can be used on the trajectory of pitch lags to increase coding efficiency. RPDC is equally applicable to both single-tap and multiple-tap pitch filter configurations. Since lag accuracy is crucial to good performance in the single-tap case, updating the lag value each time the coefficient is updated is generally preferable [3]. At a typical 7-b per lag update strategy, this can consume a significant portion of the available rate for low-rate speech coders. In RPDC, the average lag for a frame (which can contain many coding subframes) is estimated in an open-loop fashion. Subsequent updates of the single-tap coefficient include a refinement for the lag estimate which is restricted to a range of integer or fractional deviations around the average. In [11] for example, the frame wise average lag is encoded with 7 b, and 4-b/subframe are used to indicate one of 16 half-sample deviations around the average. Each subframe has a single-tap pitch coefficient which is encoded with a 3-b Lloyd–Max scalar quantizer. A similar scheme is used in [12]. RPDC is a necessary result of the use of a single-tap pitch filter, since the update rate for the lag must be comparable to the update rate for the coefficient. Multiple-tap pitch filters do not require frequent lag updates. In fact, the conclusion in [3] suggests that there is no significant advantage in updating the lag in a three-tap configuration more often than once per frame.
residual error variance is a monotone decreasing function of predictor order, a pitch filter’s prediction gain increases monotonically with the filter order. The main drawback of multiple-tap filters is the apparent need to encode the parameters with 2–3 b/coefficient [3], [6], [14] whereas the single-tap, closed-loop case requires only 3–4 b to encode the gain [4]. Vector quantization is a logical choice for minimizing coding noise and rate in the multiple-tap case since the coefficients are highly correlated. In [4], it is suggested that no further gain is available when more than 3–4 b are allocated to coding the coefficient of a one-tap filter. However, using more than 4 b to encode the coefficients of a three-tap filter provides significant gains in segmental signal-to-noise ratio (SNR). In [4], Veeneman and Mazor have encoded the vector with VQ using more than 6 b/vector. This rate was shown to exhibit greater segmental SNR than the equivalent scalar encoding in the one-tap case, and a three-tap predictor was preferred in listening tests. Similar rates (2–3 b/coefficient) have been referenced in the literature [3], [6] and in [3] the authors experimented with joint VQ of consecutive vectors to reduce the encoding rate. So at moderate rates, a multiple-tap pitch filter with efficiently encoded coefficients has a definite performance gain over its closed-loop, single-tap counterpart, and is preferred in listening tests. Unfortunately, [4] also concludes that less than 4 b to represent the multiple-tap coefficients produces results that are inferior to the single-tap case. Since the multipletap representation produces preferable synthesized speech at higher rates and the single-tap configuration requires increased side information to represent the necessary subinteger lags, efficient encoding of the coefficients for the multiple-tap case bears further scrutiny. In designing quantization schemes for these cases, a critical component of pitch-related information has been usually overlooked. Careful examination will show that some valuable information is contained in the pitch lag which pertains to the characteristics of the pitch filter. Fig. 1 shows normalized histograms of pitch lags for a large collection of utterances. The histograms exhibit clearly nonuniform distributions of lags, which can be effectively exploited in pitch filter encoding schemes. The general nonuniformity of lag distributions has been used in [6] to enumerate noninteger lags for a single-tap predictor. By using the distribution of lags to influence the training and bit allocation for several multiple-tap codebooks as opposed to specifically placing fixed noninteger lags, we can achieve both the additional prediction gain available with a higher-order filter and the inherent noninteger lag resolution of the multiple-tap configuration. Further, the requirement for a single codebook to contain enough entries to reproduce filter parameters for widely different pitch lags is not reasonable. This imposes a constraint on the design process that increases the size of the codebook in order to maintain a suitable density of representative codebook vectors for each lag (or neighborhood of lags).
B. Multiple-Tap Schemes Multiple-tap pitch filters have some theoretical advantages over their single-tap counterparts. The prediction gain of a whitening filter is defined to be the ratio of the input signal energy to the energy of the prediction residual [13]. Since
IV. LAG-INDEXED VECTOR QUANTIZATION Motivated by the observations of Section III, we have designed independent multiple-tap codebooks which use training
MCCLELLAN et al.: EFFICIENT PITCH FILTER ENCODING
21
Fig. 1. Normalized histogram of pitch lags for a large training set.
data derived only from voiced speech having particular ranges of lags [15], [16]. This minimizes the distortion for pitch filters having specific lags instead of for all possible lags. In this fashion, fewer representations in each VQ codebook are sufficient to produce an acceptable code vector density per lag and an acceptable level of average distortion. Additionally, the filter order and bit allocation can be tailored to each subset of lags, producing a variable-rate, variable-dimension scheme that is driven by the estimated pitch lag. The procedure described here is LIVQ of the pitch filter coefficients. The coefficient selection (codebook search) can be optimized in a open-loop or closed-loop fashion, and the error minimization can be performed jointly with the lag estimation. V. EXPERIMENTAL RESULTS To illustrate the effectiveness of the approach, we designed several sets of LIVQ codebooks with each set having a different number of independent codebooks and all codebooks in a set covering equal, nonoverlapping lag ranges. We denote each of these configurations “uniform” LIVQ, since it conforms with the assumption of a uniform distribution for lags. These LIVQ codebooks were trained using a large training set derived from clean speech recorded by several speakers. Each pitch filter vector in the training set was computed in an open-loop fashion from the LPC residual for the corresponding segment of speech. The pitch lag which maximized the short-term correlation [as in (4)] was used for computing each optimal coefficient vector in the training set, and this lag was associated with the vector during training. Using these several sets of uniform LIVQ codebooks, we computed the MSE for a large number of pitch vectors outside the training set. The average mean square error (MSE) and
spectral envelope distortion (SD) for the several LIVQ set are graphically shown in Fig. 2. The 3-b codebooks used for theis encoding were trained based on the uniform lag distributions shown in Table I using three-tap pitch filter vectors. It is worth noting that the horizontal axis of Fig. 2 has units of codebook vectors per lag since the number of LIVQ codebooks can be expressed equivalently as a density. The effect of the multiple codebook configuration of LIVQ is to increase the number of representative code vectors available for encoding pitch coefficients for each lag without increasing the rate. As shown in Fig. 2, both the MSE and SD’s generally decrease as the average density of codebook vectors around each lag increases. It is observed that the distortion rises slightly as the code vector density per lag approaches unity. This effect is a result of the training set being fixed size. Increasing the number of codebooks to increase the code vector density per lag results in fewer training vectors per codebook and poorer quality codebooks, which is known as undertraining in VQ literature. Note that the encoding rate remains constant as the code vector density increases due to the increasing number of independent, nonoverlapping LIVQ codebooks. Based on the results shown in Fig. 2 and the explicitly nonuniform lag characteristics of Fig. 1, we have designed another set of four nonuniform-lag-range VQ codebooks for each . The codebooks were designed for the lag ranges and bit allocations given in Table II to exploit the information contained in the distribution of lags. We denote this scheme LIVQ-VR, since it also has a variable encoding rate. Note that the LIVQ rate has a maximum value of 4 b/vector, the threshold below which the results obtained in [4] were found to be inferior to the single-tap configuration.
22
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
Fig. 2. Distortion as a function of the number of codebook entries per lag (3-b, uniform-lag LIVQ codebooks).
TABLE I CODEBOOKS FOR 3-b UNIFORM-LAG CONFIGURATIONS
TABLE II LAG RANGES AND CODEBOOK SIZES FOR NONUNIFORM LIVQ-VR
TABLE III UPDATE RATES AND ALLOCATIONS (20 MS FRAME, FOUR EXCITATION SUBFRAMES)
In the spirit of the conclusion in [3], we update the multipletap filter coefficients twice as often as the lag (every 5 versus 10 ms for the lag). So, for each LPC computation of 20 ms/frame, we have two lag updates and four pitch coefficient updates, as well as four searches through the excitation codebook. The rate (maximum 30 b/frame) required by the pitch filter updates for this LIVQ configuration is slightly less than that required by the RPDC scheme of 35 b/frame which uses 4 b/lag refinement and a single-tap, closedloop filter quantized at 4 b/update [12], [17]. B. Objective Results
A. CELP Configuration The performance of these nonuniform-lag codebooks versus a single-tap, closed-loop RPDC scheme were compared in the context of a CELP coder with an 8-b stochastic codebook. In the comparison, the excitation gain and short-term spectral parameters were not quantized, and the search through the pitch filter VQ codebooks was conducted in either an openloop or closed-loop fashion. For each type of codebook search, the initial open-loop lag estimate was taken from either the original speech or the LPC residual.
Experimental results of three-tap, five-tap, and seven-tap LIVQ for a female speaker are shown in Tables IV and V. The objective measures are SNR , segmental SNR (SNRSEG), and frequency-weighted SNR (SNRfw). In addition to LIVQ performance, results for 4- and 7-b codebooks, denoted as VQ4 and VQ7, which span the entire range of lags (17–145 samples), are included. The VQ4 and VQ7 configurations could be considered as a special case of LIVQ having a single subcodebook. Updates using the 7-b codebook were performed simultaneously with the lag updates—every 5 ms as in the 4b VQ and LIVQ cases—and so this procedure requires 42
MCCLELLAN et al.: EFFICIENT PITCH FILTER ENCODING
RESULTS
RESULTS
WITH
WITH
23
M0 E
M0 E
TABLE IV INPUT SPEECH (FEMALE SPEAKER)
STIMATED FROM
TABLE V LPC RESIDUAL (FEMALE SPEAKER)
STIMATED FROM
b/frame. Results for a version of VQ7 which has a coding rate similar to LIVQ and VQ4 are tabulated under VQ7s. In VQ7s, lag updates and coefficient updates were performed synchronously every 10 ms. Rates and allocations for each system are shown in Table III. Note from Table IV that the three-tap, open-loop LIVQ-VR has objective performance equivalent to VQ7 but requires only half of the VQ7 rate for the coefficients. Also note from the open-loop configuration that the segmental SNR of a threetap LIVQ-VR is the same as that of VQ7s, and requires approximately the same number of b/frame at a much lower computational cost. In particular, it is in the range of 8–16 vectors per LIVQ codebook as opposed to 128 for VQ7s. In the seven-tap, open-loop case, LIVQ-VR objective performance is slightly better than all other schemes considered here. In the closed-loop case, three-tap and five-tap LIVQ-VR are uniformly better than both VQ4 and VQ7s, which are the only multitap schemes with similar rate performance. The objective
results and conclusions for the case of lag estimation from the LPC residual of Table V are very similar to the use of the input speech sequence to estimate the open-loop lag. C. Subjective Results Objective performance measures are often misleading in speech processing research. In assessing the performance of complex and nonstandard speech coders, subjective measures, such as paired comparison tests, are sometimes easier to interpret than objective measures. In the case of variable-rate systems, subjective tests tend to be quite important because the distortion produced by the coder is often quite low in perceptually important segments, but extremely high in perceptually insignificant segments. Informal listening tests of speech processed using the various methods of LIVQ indicated a general preference for the three-tap, open-loop configuration in the case where the lag was estimated from the input speech, and the five-tap, closed-
24
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
TABLE VI LIVQ-VR PREFERENCES FROM PAIRED COMPARISONS IN CELP
loop configuration in the case where the lag was estimated from the LPC residual. Overall, the three-tap open-loop LIVQ configuration with lag estimated from the input speech seemed to be clearer than the other configurations. We also found a clear objective and subjective preference toward a 10 ms analysis window for coefficient estimation, versus greater than 20 ms for lag estimation. Specific results of paired comparisons between LIVQ, RPDC, and full-lag-range VQ for 4- and 7-b codebooks are shown in Table VI. In compiling the results for Table VI, frequency-weighted distortion was used during closed-loop optimization of the 4- and 7-b codebooks, whereas the LIVQVR coefficient selection was performed in an open-loop fashion.1 In most cases, the LIVQ-VR speech quality was at least comparable to the quality of the speech reconstructed using a 7-b codebook with 5 ms updates, and slightly better than that of 4- or 7-b full-range codebooks with 5 and 10 ms updates, respectively. In comparison, the average rate for variable-rate LIVQ-VR encoding is 3.6 b/vector for the female speaker, and 3.06 b/vector for the male speaker.2 For 20 ms frames with four coefficient updates per frame and lag updates twice per frame, this translates to 1420 bps (female) or 1312 bps (male). Note that the bit allocation for the LIVQ-VR codebooks was designed specifically to address the nonuniform coverage of filter taps in the frequency domain, which is known to be problematic for female speakers. High-pitched voices correspond to short lag ranges where the integer sampling rate of the time-domain sequence results in poor frequency precision. The subjective results summarized in Table VI support this objective since the LIVQ-VR performance at an average rate below 4 b/update is slightly better than VQ7 for the female speaker, and significantly better for the other schemes. The performance of LIVQ-VR for male speakers was designed to be roughly equivalent to VQ7. The subjective results of Table VI support this objective for the male speaker since the preference is clearly dependent on the position of the pairs under comparison. Listeners in this test were not allowed to 1 Note that in all subjective testing described here, coded utterances were presented to more than 20 untrained volunteer listeners in random order according to “informal” procedures described in the literature [18], [19]. 2 Rate computations for utterances encoded with LIVQ-VR are sourcedependent since the various LIVQ codebooks in the set contain different numbers of codevectors. The rates quoted here depend on the number of times each codebook was used during encoding, and are averaged over the duration of the utterance.
TABLE VII LAG RANGES AND CODEBOOK SIZES FOR LIVQ-FR
declare a tie between methods and so they tended to select the most-recently-heard utterance of the pair. D. Improved LIVQ with Fixed Rate The performance of LIVQ in a fully-coded CELP configuration can be improved by the use of more codebooks. This technique can also be used to lower the average encoding rate. The lags and allocations for an improved, fixed-rate LIVQ configuration (LIVQ-FR) are shown in Table VII. In addition to increasing the number of distinct codebooks over LIVQ-VR, LIVQ-FR focuses on improving the representation of pitch filters corresponding to large lags as are often found in male speakers. Training data for the coefficient vectors was taken from overlapping subsets of lags to improve any boundary mismatch between the disjoint lag ranges used in the coding procedure. In addition to attempting to improve fidelity, the encoding rate of LIVQ-FR also used a fixed 1-b/coefficient design goal under a 3-b/update regime with a three-tap filter. Evidence of the improved allocation of pitch filter coefficient rate can be seen from the histograms of LIVQ codebook hits for coded speech in Figs. 3 and 4. In these figures, speech data for the five “Harvard” sentences listed in Table VIII was processed in a CELP configuration using the codebooks described in Table VII. The histogram for each of the several codebooks is shown covering the approximate range of lags for which it was used in encoding, where the height of each bar is the number of times the LIVQ codebook was used in encoding the utterance. The improved resolution of additional LIVQ codebooks is clearly seen for the female speakers around lags of 40 samples. The allocation for male speakers shows good distribution between lags of 50–100 samples, but could possibly be improved, especially for deep voices, by designing extra codebooks for
MCCLELLAN et al.: EFFICIENT PITCH FILTER ENCODING
Fig. 3.
Fig. 4.
25
Distribution of LIVQ-FR codebook hits (female).
Distribution of LIVQ-FR codebook hits (male).
TABLE VIII HARVARD SENTENCES
lags around 100. Since the histogram bins at large lags contain many “hits,” the coding efficiency could be improved by using more codebooks there, which results in a finer distribution of
lag ranges and narrower bins. The perceptual improvement with LIVQ-FR versus LIVQ-VR was subtle for the two female speakers, but significant for the male speakers. Particular improvement was noted for sentence 3 spoken by a male speaker with a very deep voice, which seemed to have a much smoother perceptual quality with the LIVQ-FR configuration as a result of the increased density of code vectors per lag for large lags. Interestingly, even when the distribution of lags as shown in Fig. 1 is uniform, the LIVQ approach provides some improvement over single codebook VQ techniques because of the greatly increased density of codebook vectors per lag (see Fig. 2). This localization of the distortion measure during the
26
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
TABLE IX SUBJECTIVE PREFERENCE OF FS-1016 WITH LIVQ-FR VERSUS “NATIVE” FS-1016
design and coding process of LIVQ is primarily responsible for the increase in qualitative performance without increased rate. VI. PERFORMANCE OF FS1016
WITH
LIVQ
To further examine the performance of the LIVQ architecture, we implemented an LIVQ-based codebook search in the FS-1016 CELP coder [16] and compared the subjective performance of FS-1016 versus this modified FS-1016 architecture (FS-LIVQ). For the study, we implemented LIVQ-FR codebooks using lag ranges as in Table VII with 3 b/vector as well as 5 b/vector corresponding to overall coding rates of 4000 and 4800 b/s, respectively. The results of these subjective comparisons on Harvard sentences are shown in Table IX where the percentage of listeners preferring FS-LIVQ over the standard FS-1016 is given. It is worth noting that at the 4800 b/s encoding rate, and with all else being equal, the FSLIVQ coder is overwhelmingly preferable. Additionally, with 3 b per LIVQ code vector for an encoding rate of 4000 b/s, the subjective quality of FS-LIVQ is roughly equivalent to “native” FS-1016 at 4800 b/s. This comparison is significant in demonstrating that a straightforward enhancement to the FS-1016 pitch subsystem can produce significant quality improvements at the same rate, or equivalent quality at a rate 20% lower than the “native” rate. This comparison is also significant in that the FS-1016 architecture utilizes noninteger pitch lags to optimize the performance of its closed-loop, single-tap pitch predictor and a form of DPCM on the lags to optimize the rate required by synchronous updates of the lag and coefficient. Operating at the same (or lower) coding rate, the FS-LIVQ coder uses open-loop lag estimation and coefficient searches, and so has reduced computational requirements. Also, FS-LIVQ does not require smoothing or DPCM of the lag trajectory, nor does FS-LIVQ require “upsampling” to achieve subinteger lag resolution. VII. IMPLEMENTATION DETAILS The open loop pitch filter is straightforward and needs no further description. The closed loop optimization, however, can be implemented in several different ways to conserve computation and encoding rate. In theory, the unconstrained closed loop optimization procedure uses a finite set of lag
candidates, , and seeks the filter parameter(s) which min. RPDC imize the reconstruction error for each can be viewed as a special case here, in that the set of lag candidates for RPDC is dependent on an external parameter, , i.e., . The use the open loop lag estimate contains of subinteger lags is also a special case wherein an integer quantity of real numbers. In the usual case of where Hz, integer lags, Hz, and so that 7 bits are required Hz, we have to specify a particular lag. For . , the aim is to minimize the resulting For each reconstruction error. So, the synthetic speech corresponding to each possible quantization of the optimal filter coefficient(s) must be generated. The open loop pitch residual for can serve as an “estimate” for the residual of the quantized coefficient vector. The unquantized residuals are, in general, is a function of the lag [as different for each pitch lag since in (2)]. Given the “estimated” residual derived using , the which minimizes the reconstruction error for index of the a particular lag is retained. The index and lag that minimize is used in subsequent coding and the error over all is transmitted to the receiver. A. Residual Estimation In closed-loop optimization of CELP coders, better results are obtained when the pitch lag is greater than the dimension of the excitation codebook (subframe). In this case, the pitch filter reconstruction from previous subframes is available for use in the minimization procedure and no residual estimation is required. For lags that are less than the frame size, estimation of the pitch filter output is required. In some schemes, the previous LPC residual (past subframe output) is replicated in place of the unknown residual for the current subframe [8], [20]. This precludes the use of efficient recursive solutions for the filter coefficients in these subframes. Other schemes achieve better computational efficiency by using the openloop LPC residual as an estimate of the pitch filter output [4], [8]. This requires mistracking between analysis and synthesis phases of encoding. Several alternative techniques have been used in the literature. Notably, the procedure used in QCELP [8] assumes that has zero amplitude the pitch residual corresponding to and hence the choice of optimal closed-loop coefficient is based on correlations between the zero-input response of the reconstruction filters (pitch and formant) and the frequencyweighted input speech. Specifically, QCELP synthesizes using a zero-amplitude codebook vector, a unity-gain single-tap pitch filter, and using the closed-loop pitch filter outputs as the filter history. This zero-input response is then used to generate correlations for the joint lag-coefficient optimization process previously discussed. This procedure is an approximation based on the assumption that there is no correlation between the current pitch residual and the pitch filter history. This amounts to assuming that the subset AR modeling is very accurate, all redundancy can be removed by the quantized filter, and the resulting pitch residual is white, or that the nature of the pitch periodicity has changed significantly.
MCCLELLAN et al.: EFFICIENT PITCH FILTER ENCODING
In the CELP configuration used here, we adopt an approach which differs from that of QCELP. We rely on the open loop lag and coefficient estimates to produce an unquantized residual which is approximately valid for lags confined to a and small range. This produces a slight suboptimality since the corresponding residual are not recomputed separately for each lag. This procedure is compatible with many schemes, including RPDC. B. Pitch Harmonics Another implementation detail used here in the closed-loop optimizations is the additional checking of integer multiples or submultiples of the open-loop lag. For lags which are smaller than half the range, the first harmonic lag and several adjacent lags are included in the minimization procedure. For lags which are greater than half the range, the first subharmonic lag neighborhood is included. Thus, for RPDC lag updates, 1 b specifies the use of the open loop lag or its harmonic, and 3 b specify the lag offset. C. Filter Stability Filter stability is an issue that can produce significant problems in multiple-tap pitch filters. Due to the covariancetype formulation of the multiple-tap minimization procedure [5], the stability of the optimal filter is not guaranteed. Ramachandran and Kabal [21] solved this problem elegantly for the three-tap case by proposing a simplified stability test and accompanying stabilization procedure which maintains the spectral characteristics and minimizes the loss of prediction gain in the stabilized filter. Unfortunately, although stability tests exist for arbitrary filter orders, the stabilization procedure in [21] cannot be directly generalized for higher-order pitch filters. To address this problem in the case of LIVQ, the filter stability criterion can be enforced during the gathering of the training set and during the training of each multitap codebook. The training process based on the generalized Lloyd algorithm [22] may still produce unstable filters, since the training vectors are averaged together in each Voronoi cell to produce the centroid. To guarantee that the codebook contains only stable pitch filter vectors, we train a codebook that is larger than the target bit rate and retain only a subset of stable vectors. Since filter stability is guaranteed, there is no requirement for computationally expensive stability tests for each transmitted set of filter coefficients. D. Postfiltering In forward-adaptive coders such as CELP, noticeable perceptual improvement is available through the use of a postfilter. Adaptive postfiltering is used in many national and international speech coding standards, including FS-1016 (CELP), IS-54 (VSELP), IS-95 (QCELP), and CCITT G.728 (LDCELP) [1]. Postfilters can be divided into two sections: a long-term postfilter that emphasizes pitch harmonic peaks and attenuates spectral nulls between harmonics, and a shortterm postfilter which emphasizes formant peaks and attenuates spectral valleys between formants [23].
27
The combined transfer function for the most general overall postfilter is
where and are transfer functions for the long-term (pitch) and short-term (formant) postfilter sections, respecis a first-order spectral tilt (brightens) filter, and tively, is a scaling factor that compensates for any gain due to post filtering. The transfer function for a single-tap long-term postfilter is (7) is a gain control factor separate from . The longwhere term postfilter in the CELP coder discussed herein can use a single-tap as in (7). However, this configuration does not make use of the additional information available due to LIVQ. Specifically, the multitap LIVQ configuration contains a small amount of spectral envelope information in addition to the comb filter spacings which are established by the pitch lag. Fig. 5 shows the spectrum of single-tap and multiple-tap pitch filters where the gains have been offset for clarity. The singletap filter has a uniform spectral envelope since the pitch lag is the only information used in constructing this filter. The multitap pitch filter contains some subtle spectral envelope information not completely characterized by the formant filter, which has a relatively small, fixed model order (usually ten). By using LIVQ, this additional spectral shaping can be transmitted at the same rate as the single-tap pitch filter parameters, and then used in the transfer function of (7) by replacing the single lag/gain term with a pitch predictor of appropriate order. The transfer function is then given by (8)
where we consider only the all-pole portion of the filter and a third-order pitch filter encoding for simplicity. The adaptive can be computed according to the procedure scaling factor outlined in [23] for the single-tap case. Subjective results gathered after processing speech with our CELP/LIVQ coder reveal a significant preference for the postfiltered speech. The formant postfilter removes a “graininess” in the encoded speech which is subjectively annoying, and the single-tap pitch postfilter makes an audible improvement in the “roughness” of the speech. The use of the LIVQ multiple tap information in the pitch postfilter serves to reinforce the subjective preference of the postfiltered speech, thereby making the coded speech somehow “smoother” and without the background roughness present in the single-tap case. VIII. CONCLUSIONS Our results here support the conclusion reached in [4] that multiple-tap pitch filters are subjectively preferable. In addition to improved subjective quality, our multitap LIVQ scheme achieves a better, and possibly variable, encoding rate and better objective results than a corresponding RPDC scheme.
28
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 1, JANUARY 1999
Fig. 5.
Spectra of single-tap and multiple-tap pitch filters.
The LIVQ scheme provides objective and subjective results equivalent to a single 7-b, full-lag-range VQ searched at the same rate and requiring 50% more b/frame. Additionally, the computational complexity of LIVQ is reduced from the 7-b case by a factor of at least eight due to the small sizes of the lag-indexed subcodebooks. The tailoring of the LIVQ coefficients to particular ranges of lags also seems to improve the effectiveness of the long-term postfilter. Further, we demonstrate that the incorporation of LIVQ techniques into the existing FS-1016 architecture-single-tap, noninteger lags-can significantly improve the subjective quality of this well-known, standardized coder at the same rate, or produce an equivalent-quality coder at a 20% lower rate.
[7] J. Marques, I. Trancoso, J. Tribolet, and L. Almeida, “Improved pitch prediction with fractional delays in CELP coding,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Albuquerque, NM, Apr. 1990, pp. 665–668. [8] W. Gardner, P. Jacobs, and C. Lee, “QCELP: A variable rate speech coder for CDMA digital cellular,” in Speech and Audio Coding for Wireless Networks, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1993, pp. 85–92. [9] National Commun. Syst., Off. Technol. Stds., “Telecommunications: Analog to digital conversion of radio voice by 4800 bit/s code excited linear prediction (CELP),” Fed. Std. 1016, 1991. [10] I. Gerson and M. Jasiuk, “Vector sum excited linear prediction (VSELP) speech coding at 8 kb/s,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Albuquerque, NM, Apr. 1990, pp. 461–464. [11] S. Wang and A. Gersho, “Improved phonetically-segmented vector excitation coding at 3.4 kbps,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, San Francisco, CA, Mar. 1992, pp. I-349–I352. [12] E. Paksoy, K. Srinivasan, and A. Gersho, “Variable rate speech coding with phonetic segmentation,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Minneapolis, MN, Apr. 1993, pp. II-155–II-158. [13] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood Cliffs, NJ: Prentice-Hall, 1984. [14] B. S. Atal, “Predictive coding of speech at low bit rates,” IEEE Trans. Commun., vol. COMM-30, pp. 600–614, Apr. 1982. [15] S. McClellan and J. Gibson, “Lag-indexed VQ for pitch filter coding,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Atlanta, GA, May 1996. [16] K. Rutherford, S. McClellan, and R. Adhami, “Improving the performance of Federal Standard 1016 (CELP),” in Proc. IEEE Southeast Conf., Tampa Bay, FL, Apr. 1996, pp. 216–219. [17] S. Wang and A. Gersho, “Phonetically-based vector excitation coding of speech at 3.6 kbps,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Glasgow, U.K., May 1989, pp. 49–52. [18] S. Dimolitsas, “Subjective assessment methods for the measurement of digital speech coder quality,” in Speech and Audio Coding for Wireless Networks, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1993, pp. 43–53. [19] N. Kitawaki and H. Nagabuchi, “Quality assessment of speech coding and speech synthesis systems,” IEEE Commun. Mag., pp. 36–44, Oct. 1988. [20] J. Campbell, V. Welch, and T. Tremain, “The new 4800 bps voice coding standard,” in Proc. Military and Government Speech Technology, Nov. 1989, pp. 735–737. [21] R. P. Ramachandran and P. Kabal, “Stability and performance analysis of pitch filters in speech coders,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, pp. 937–946, July 1987. [22] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Boston, MA: Kluwer, 1992. [23] J.-H. Chen and A. Gersho, “Adaptive post filtering for quality enhancement of coded speech,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 59–71, Jan. 1995.
REFERENCES [1] R. Cox, “Speech coding standards,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds. Amsterdam, The Netherlands: Elsevier, 1995, pp. 49–78. [2] A. Das, E. Paksoy, and A. Gersho, “Multimode and variable-rate coding of speech,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds. Amsterdam, The Netherlands: Elsevier, 1995, pp. 257–288. [3] M. Yong and A. Gersho, “Efficient encoding of the long-term predictor in vector excitation coders,” in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1991, pp. 329–338. [4] D. Veeneman and B. Mazor, “Efficient multitap pitch prediction for stochastic coding,” in Speech and Audio Coding for Wireless Networks, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1993, pp. 225–229. [5] B. S. Atal and M. R. Schroeder, “Adaptive predictive coding of speech,” Bell Syst. Tech. J., vol. 49, pp. 229–242, Oct. 1970. [6] P. Kroon and B. S. Atal, “On improving the performance of pitch predictors in speech coding systems,” in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds. Boston, MA: Kluwer, 1991, pp. 321–327.
Stan McClellan (SM’90–M’95) received the B.S., M. S., and Ph.D. degrees, all in electrical engineering, from Texas A&M University, College Station, TX, in 1986, 1991, and 1995, respectively. He was with LTV Missiles and Electronics, Grand Prairie, TX, and General Dynamics, Ft. Worth, TX, and, at Texas A&M, he was a Research Assistant in the Telecommunications, Control, and Signal Processing Research Center, and an Assistant Lecturer for the Department of Electrical Engineering. He has been with the University of Alabama at Birmingham (UAB) since July, 1995, where he is an Assistant Professor in the Department of Electrical and Computer Engineering and a Research Engineer in the UAB Center for Telecommunications Education and Research. His current research interests include digital signal processing, data compression, high-speed computer networking, and development of real-time medical applications for the Linux operating system. Dr. McClellan is a member of the IEEE Signal Processing, Communications, Information Theory, and Biomedical Engineering Societies.
MCCLELLAN et al.: EFFICIENT PITCH FILTER ENCODING
Jerry D. Gibson (F’92) currently serves as Chairman of the Department of Electrical Engineering at Southern Methodist University, Dallas, TX. He has held positions at General Dynamics, Fort Worth, TX (1969–1972), the University of Notre Dame, Notre Dame, IN (1973–1974), the University of Nebraska, Lincoln (1974–1976), and during the fall of 1991, he was on sabbatical with the Information Systems Laboratory and the Telecommunications Program, Department of Electrical Engineering, Stanford University, Stanford, CA. From 1987 to 1997, he held the J. W. Runyon, Jr. Professorship in the Department of Electrical Engineering at Texas A&M University, College Station, TX. His research interests include data, speech, image, and video compression, multimedia over networks, wireless communications, information theory, and digital signal processing. He is coauthor of Introduction to Nonparametric Detection with Applications (Piscataway, NJ: IEEE Press, 1995), the author of the textbook Principles of Digital and Analog Communications (Englewood Cliffs, NJ: Prentice-Hall, 2nd ed., 1993), and co-author of the book Digital Compression for Multimedia (San Mateo, CA: Morgan Kaufmann, 1998). He is Editorin-Chief of The Mobile Communications Handbook (Boca Raton, FL: CRC, 1995) and Editor-in-Chief of The Communications Handbook (Boca Raton, FL: CRC, 1996). Dr. Gibson was Associate Editor of Speech Processing for the IEEE TRANSACTIONS ON COMMUNICATIONS from 1981 to 1985 and Associate Editor for Communications for the IEEE TRANSACTIONS ON INFORMATION THEORY from 1988 to 1991. He has served as a member of the Speech Technical Committee of the IEEE Signal Processing Society (1992–1995) and on the Editorial Board for the Proceedings of the IEEE (1991–1997). He is currently a member of the IEEE Information Theory Society Board of Governors (1990–1998). He served as President of the IEEE Information Theory Society in 1996. In 1990, he received The Frederick Emmons Terman Award from the American Society for Engineering Education, and in 1992, was elected Fellow of the IEEE “for contributions to the theory and practice of adaptive prediction and speech waveform coding.” He was co-recipient of the 1993 IEEE Signal Processing Society Senior Paper Award for the speech processing area.
29
B. Keith Rutherford received the B.S. degree in electrical engineering from Auburn University in 1991, the M.S.E.E degree from the University of Alabama, Huntsville, in 1998, and the M.B.A. degree from Auburn University, Auburn, AL, in 1998. He joined Motorola, Huntsville, in 1991. His primary assignment involved the hardware and software design of modem products, mainly for the European Community. In 1995, he joined Southern Research Technologies, Birmingham, AL, working on the design of video imaging products for use in military tracking systems. Currently, he is a Project Manager at Southern Computer Systems, Birmingham.