VARIABLE DIMENSIONAL ALGEBRAIC CELP CODING OF PROTOTYPE WAVEFORMS Jongseo Sohn and Wonyong Sung
School of Electrical Engineering, Seoul National University, Seoul, Korea. e-mail: sohn,
[email protected]. ABSTRACT
We propose a variable dimensional algebraic codebook structure in order to quantize the prototype waveforms eciently. The proposed algorithm adjusts the interval of candidate pulse positions in a codevector according to the pitch period. The analysis-by-synthesis search procedure is computationally ecient due to the characteristics of the algebraic codebook structure. We also develop a method to perceptually improve the basis pulse of codevectors, which enhances the reconstructed speech quality with little increase in computational complexity. The improved prototype waveform interpolation coder adopting the proposed methods achieves a high quality of speech at 4 kbps.
1. INTRODUCTION Most of the current low bit rate vocoders, such as waveform interpolation (WI), sinusoidal transform coders, multi-band excitation (MBE), and mixed excitation linear prediction (MELP), produce good speech quality but fail to achieve toll quality. Previous works suggest that this failure is mainly because they do not transmit phase information that is perceptually important especially in voiced speech [1][2]. Thus, an eective coding method for the phase spectrum of pitch cycle waveform is very needed. The phase spectrum can be quantized separately from the magnitude part using phase codebooks [1][3], but it is dicult to apply predictive coding schemes due to the modulo-2 attribute of the phase signal. Another method is the direct quantization of the pitch cycle waveforms, where the analysis-by-synthesis procedure of conventional CELP coders can be employed, in addition to predictive coding. This approach has been reported earlier in the original work on WI coder [4], where the prototype waveform is predicted using the previous prototype waveform and the prediction residual is vector quantized by the analysisby-synthesis procedure. The dimension of the codebook is determined considering the maximum pitch period allowed and the codebook entries are stored in the Fourier series coecient form. For the prototype waveform that has shorter dimension than the maximum, the trucated version of stored codebook entry is used as a codevector. But this truncated codevector fails to retain the pulse-like nature of pitch cycle waveforms. They solved this problem by empolying a two-stage codebook, where all the codevectors in
the rst-stage codebook describe only one pulse. In this paper, we develop a variable dimensional algebraic codebook structure and its analysis-by-synthesis procedure to improve the prototype waveform interpolation (PWI) coder. Since the proposed codebook assumes a few pulses in each codevector, it represents well the pulse-like nature of pitch cycle waveforms. In addition, the merits of the time-domain algebraic CELP (ACELP) [5][6] such as the computational and storage eciency are carried over into the proposed variable dimensional ACELP (VD-ACELP). We further improve the VD-ACELP by adapting the basis pulse to the perceptually weighted speech spectrum, which enhances the reconstructed speech quality while requiring little computational overhead. A complete VD-ACELP algorithm and performance improvement methods are presented in the following sections.
2. VARIABLE DIMENSIONAL ACELP CODING OF PROTOTYPE WAVEFORMS 2.1. Variable dimensional ACELP principles In the analysis-by-synthesis procedure of conventional CELP coders, the weighted synthesis ltering is implemented as follows. First, the contribution of the lter memory is removed from the target signal and a linear convolution between the lter impulse response and the codevector is performed. For the encoding of the prototype waveforms, however, the above process is replaced by a circular convolution to take into account the periodic nature of prototype waveforms [4]. Let us denote the z-transform of LPC synthesis lter as 1=A(z), and that of the error weighting lter as
z= 2 ) W (z) = AA((z= )
(1)
1
with 0 < 1 < 2 < 1. The codebook search is to minimize the following weighted mean square error (MSE):
Ei = kHt ; gHci k (2) where g is the codebook gain, t and ci are the Fourier series coecient vectors of target signal and codevector, respectively, and H is a diagonal matrix of which diagonal is the 2
periodic impulse response of weighted synthesis lter, i.e.,
the k-th element of its diagonal is given by
Hk = W (z)=A(z)
L z=ej 2k
:
ilarly, the denominator of (4) can be re-arranged as (3)
It can be shown that the optimal codeword maximizes
j;i j , j i 2
P
L;1 jHk j2 (T Ci;k + Tk C )j2 k i;k ; k=0 P L;1 jHk j2 jCi;k j2 k=0
(4)
(5)
where 0 < ML . We take a zero value for since it is only a nuisance parameter that has no eect on the future derivation and the coder performance. Let us denote the number of time samples from the zeroth position grid to the l-th position grid as g(l) = (l;M1)L , which is not necessarily an integer, and denote the Fourier series coecient vector of a g(l) sample delayed impulse as pl = (Pl;0 ; : : : ; Pl;L;1 ), 2kg(l)M then its k-th element is given by Pl;k = e;j L . We allow only N pulses in each codevector and compose the codevectors with N arbitrary pulses as
ci =
NX ;1 n=0
ln pln ;
(6)
where ln is the sign of pln . The codebook index i is determined by the grid index ln 's and the sign ln 's of the selected pulses. Substituting (6) into the numerator of (4) and exchangP ;1 ing the order of summations, we obtain ;i = Nn=0 d(ln ), where
d(l) , l
LX ;1 k=0
): jHk j (TkPl;k + Tk Pl;k 2
X
n;k
jHk j jPln ;k j + 2
= N (0) + 2
where Tk and Ci;k denotes the k-th components of t and ci , respectively, and represents the complex conjugate operator. For the design of a codebook for (4), we have to consider some requirements on the codebook structure. The codebook should have a computationally ecient structure for its feasibility. In addition, it should be able to cope with the dimension variation due to the pitch change. To meet these requirements we propose a variable dimensional algebraic codebook structure, where the time resolution of the allowed pulse positions are dynamically scaled according to the dimension variation. In other words, there are a constant number, M , of candidate pulse positions regardless of the pitch period L, and the cadidate positions can be any equally spaced (in L-periodic sense) M grid set, e.g.,
; 1)L g; f; + ML ; + 2ML ; : : : ; + (M M
i =
(7)
Prior to the codebook search, d(l) values for l = 0; : : : ; M ; 1 are pre-computed to minimize the search complexity,1like the backward ltering in the time-domain ACELP [5]. Sim1 In (7), the number of summations over k can be reduced to L=2 considering the complex conjugate symmetry property of the Fourier series coecients of a real signal.
2
X
m6=n
X
m6=n
jHk j (Plm ;k Pln;k + Plm ;k Pln ;k ) 2
(8) (9)
(lm ; ln );
where (l) =
X
k
jHk j RefPl;k g: 2
(10)
Among (l) values for l = 0; : : : ; M ; 1, only the actually needed elements, which depends on the constraint on the pulse positions, are computed before the search loop begins. The (l) can be interpreted as the autocorrelation of the periodic impulse response of the weighted synthesis lter. A bene cial property of this periodic analysis-by-synthesis procedure is the stationarity of (l), i.e., it depends only on the position grid dierence, lm ; ln , or the time difference g(lm ; ln ) = g(lm ) ; g(ln ) between the two pulse postions g(lm) and g(ln), which does not hold in the timedomain ACELP. Thus, the computational load in the periodic ACELP is much less than that of the time-domain ACELP. For example, in the worst case that there is no constraint on the pulse positions, the periodic ACELP requires only M autocorrelations, while the time-domain ACELP requires M (M ; 1) cross-correlations for all the possible combinations of lm and ln . Like the time-domain ACELP codebook, the periodic ACELP requires no storage or memory for codebook entries. The actual value of the pulse vector pl is used only to compute d(l) and (l) for each l = 0; : : : ; M ; 1. When calculating d(l) and (l), we can generate pl by simply reading the stored M cosine values at 0; 2M ; : : : ; 2(MM;1) in an l-interleaved and modulo-M way.
2.2. Perceptually improved basis pulse
Although the analysis-by-synthesis procedure of the VDACELP selects the codevector that minimizes the perceptually weighted MSE, it does not optimize the codevector itself. Each codevector consists of pulses which are time shifted versions of one basis pulse having a at magnitude spectrum. Thus, we develop a method to improve the basis pulse, where the spectrum of the basis pulse is adapted according to the perceptually weighted LPC spectrum. The weighted LPC lter is commonly used in CELP coders [7], which is given by z= 1 ) ; E (z) = AA((z= (11) 2) with 0 < 1 < 2 < 1. The weighting factors 1 and 2 may be the same, but not necessarily, as 1 and 2 in (1) repectively. The new basis vector p~ 0 is obtained by p~ 0 = Ep0 , where E is also a diagonal matrix with its k-th diagonal element given by
Ek = E (z)
z=ej 2Lk
:
(12)
Consequently, we have
p~ ln = Epln
(13) Substituting (13) into (2) through (6), we obtain the new weighted MSE equation, E~ = kHt ; gHEck2 ; (14) which can be rewritten as follows: ~ k2 E~ = kH~ ~t ; gHc (15) where H~ = HE (16) ~t = E;1 t: Note that H~ is also diagonal since both H and E are diagonal matrices. Now, we can nd that (15) is exactly the same form with (2) except that H and t are replaced with H~ and ~t, respectively. The codebook search procedure with the new basis pulse is now obvious. First, modify H and t according to (16). This does not need much computation since E is a diagonal matrix. Then, apply the fast ACELP codebook search procedure with the spectrally at basis pulse. Finally, obtain the nal codevector c~opt by applying the perceptual weighting lter to the selected codevector copt , i.e., ~copt = Ecopt .
3. DESIGN OF A VARIABLE DIMENSIONAL ACELP PWI CODER AT 4 KBPS A 4 kbps PWI coder employing the proposed VD-ACELP algorithm is developed. The developed coder operates on a 20 ms frame basis and requires 7.5 ms look-ahead time for spectral and pitch analysis. The PWI coding method is applied to only voiced signal and a conventional CELP coder is used for unvoiced part of the signal. This CELP coder uses only the xed codebook, and the pitch predictor is not used. The xed codebook is a sparse overlapped codebook. The codebook and the gain indices are updated every 5 ms. The PWI coder developed in [4] updates the prototype waveforms at 33-50 Hz, which is a typical frame rate of speech coders. But, such a slow update rate causes occasional buzziness, and they alleviated this problem by adjusting the signal to change ratio (SCR) between consecutive quantized prototype waveforms. In our implementation, however, the prototype waveforms are updated quite frequently, at 100 Hz, for two reasons. When we synthesized speech using unquantized prototype waveforms with this or faster update rate, we did not nd any tonal artifact in the reconstructed speech as long as the coding mode is properly chosen. From this experiment, we are convinced that the SCR adjustment or the waveform decomposition [8] are not much needed as the prototype waveforms are quantized more faithfully with an enough fast update rate. The other reason is the quantization eciency. Within a bit budget of 2.4 kbps for prototype waveform quantization, we quantized prototype waveforms using VD-ACELP at two dierent update rates. When it is quantized and
updated at 100 Hz, 24 bits are assigned to each prototype waveform quantization, while 48 bits at 50 Hz. Since it is dicult to design a single codebook of 48 bits, we employed a two-stage codebook for the 50 Hz update case. In this experiment, the synthesized speech quality of the former case was much better than the latter regardless of the pitch period and we preferred to 100 Hz update rate. This might be due to the suboptimality of the two-stage codebook and the use of more carefully designed codebook would change the result at least for male speech. In voiced frames, the pitch period is determined and quantized using 7 bits. The number of pitch cycle waveforms to be extracted is calculated according to the pitch period. Every 10 ms, twice in a frame, pitch cycle waveforms are extracted, aligned, and gain-normalized. The shape of the prototype waveform is obtained by averaging the normalized pitch cycle waveforms with proper weighting. The gain of the prototype waveform is estimated in an analysis-by-synthesis manner, and the prototype waveform is multiplied by the gain. This gain-scaled prototype waveform is predicted using the previous quantized prototype waveform and the prediction residual is quantized with the proposed VD-ACELP. According to the dimension of the prototype waveform, codebooks of dierent structure are employed, which are shown in Table 1, 2, and 3. When the dimension is larger than 50, a 18 bit codebook of 96 grids and a 12 bit codebook of 60 grids are applied to the quantization of the rst and the second prototype waveforms, respectively. Otherwise, a 15 bit codebook of 48 grids is used to quantize both of the prototype waveforms. Table 1: Structure of 12 bit algebraic codebook, M = 60 Pulse Sign Positions Bits 0 0,2,4,...,62 1+5 1 1,3,5,...,63 1+5 Table 2: Structure of 15 bit algebraic codebook, M = 48 Pulse Sign Positions Bits 0 0,3,6,...,45 1+4 1 1,4,7,...,46 1+4 2 2,5,8,...,47 1+4 Table 3: Structure of 18 bit algebraic codebook, M = 96 Pulse Sign Positions Bits 0 0,3,6,...,93 1+5 1 1,4,7,...,94 1+5 2 2,5,8,...,95 1+5 Although the PWI coder transmits the information on the phase spectrum relavent to the shape of the pitch cycle waveform, it does not transmit the linear phase information. Therefore, the reconsturcted signal is usually asynchronous with the original one in voiced frames. This may cause signal discontinuity at the frame boundary when there is a switching between the PWI and CELP modes. Similar problems are addressed in such coders that both the
frequency- and time-domain coding methods are used for dierent portion of the signal [9][10]. To reduce the artifact at these frame boundaries, we try to keep the synchrony between the synthesized and the original signals. In the rst PWI frame switched from the CELP, the rst prototype waveform is not aligned to the previous one to maintain the synchrony. At each frame, the time dierence between the synthesized and the original signal is measured at the encoder and the pitch in the next frame is slightly modi ed to reduce this time dierence [11]. This modi ed pitch is quantized and transmitted. The bit allocation among the model parameters of the developed PWI coder is summarized in Table 4. The LPC coecients are transformed to the line spectral frequencies (LSF), which are quantized using a 24 bit split vector quantizer [12]. The prediction gain and the xed codebook gain are vector quantized using 9 bits. Table 4: Bit allocation of the proposed 4 kbps coder. Parameters PWI CELP Mode 1 bit 1 bit LSF's 24 bit 24 bit Pitch 7 bit Gain 9+9 bit 5+5+5+5 bit Codebook index 15+15 (18+12) bit 8+9+9+9 bit Total bit/frame 80 bit 80 bit
4. SUBJECTIVE TEST RESULTS AND CONCLUSIONS We conducted forced choice A-B comparison tests using 10 sentence pairs. Each sentence was uttered by 10 dierent, 5 female and 5 male, speakers. The referece coder was the ITU-T G.723.1 CELP coder at 6.3 kbps mode. The pairs were presented to 8 subjects in a random order. The preference test results are given in Table 5, where the proposed 4 kbps PWI coder showed a slightly better performance than the toll quality 6.3 kbps G.723.1 coder. In these tests, the proposed PWI coder performed better for female speech than for male speech. This would be because the time resolution of the pulse positions in the VD-ACELP codebook decreases as the pitch period increases. Therefore, for low-pitched speech, it seems more important to describe the details of entire prototype waveform than to update them more frequently. Our future research will focus on this topic. Table 5: Preference test results 6.3 kbps G.723.1 Proposed 4 kbps PWI Female 40.00 % 60.00 % Male 57.50 % 42.50 % Average 48.75 % 51.25 %
REFERENCES [1] O. Gottesman, \Dispersion phase vector quantization for enhancement of waveform interpolative coder," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1999. [2] H. Pobloth and W. B. Kleijn, \On phase perception in speech," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1999. [3] Y. Jiang and V. Cuperman, \Encoding prototype waveforms using a phase codebook," in Proc. IEEE Workshop on Speech Coding for Telecomm., 1995, pp. 21{25. [4] W. B. Kleijn, \Encoding speech using prototype waveforms," IEEE Trans. Speech Audio Processing, vol. 1, no. 4, pp. 386{399, 1993. [5] J-P. Adoul, P. Mabilleau, M. Delprat, and S. Morissette, \Fast CELP coding based on algebraic codes," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1987, pp. 1953{1956. [6] C. La amme, J-P. Adoul, S. Morissette R. Salami, and P. Mabilleau, \16 kbps wideband speech coding technique based on algebraic CELP," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1991, pp. 13{17. [7] J-H. Chen and A. Gersho, \Adaptive post ltering for quality enhancement of coded speech," IEEE Trans. Speech Audio Processing, vol. 3, no. 1, pp. 59{71, 1995. [8] W. B. Kleijn and J. Haagen, \A speech coder based on decomposition of characteristic waveforms," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1995, pp. 508{511. [9] E. Shlomot, V. Cuperman, and A. Gersho, \Combined harmonic and waveform coding of speech at low bit rates," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1998, pp. 585{588. [10] J. Sohn and W. Sung, \A low resolution pulse position coding method for improved excitation modeling of speech transition," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1999. [11] W. B. Kleijn, P. Kroon, L. Cellario, and D. Sereno, \A 5.85 kbps CELP algorithm for cellular applications," in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1993, pp. 596{599. [12] K. K. Paliwal and B. S. Atal, \Ecient vector quantization of LPC parameters at 24 bits/frame," IEEE Trans. Speech Audio Processing, vol. 1, no. 1, pp. 3{14, Jan. 1993.