classic transform-coding method for speech and com- ..... Adaptive Transform Speech Coding at Low Bir ... actions on Acoustics, Speech and Signal Process-.
Application of the Discrete Laguerre Transform to Speech Coding Giridhar Mandyam, Nasir Ahmed and Neeraj Magotra Dept. of Electrical and Computer Engineering The University of New Mexico Albuquerque, New Mexico 87131
Abstract
The discrete Laguerre transform (DLT) displays characteristics which make it amenable to the coding of speech. In this paper, the DLT is used in a classic transform-coding method for speech and compared to the discrete cosine transform (DCT), which has been widely applied to speech coding. The results show promise, as the DLT outperforms the DCT at very low bitrates.
1 Introduction
Recently, there has been a growing interest in transform-based techniques for speech coding (also known as frequency-domain coding). Transformbased techniques achieve decorrelation without the need for determining correlation statistics for each data frame, and also do not usually require pitch detection. However, transform-based techniques suer from decreased performance at low bitrates [1]-[3]. The discrete cosine transform (DCT) has been widely implemented for frequency-domain speech coding, due in part to its simplicity and eciency in computation. Our objective here is to compare the performance of the discrete Laguerre transform (DLT) with that of the DCT in the interest of providing better performance at low bitrates. This is accomplished by comparing the two transforms using the transformcoding method of Zelinski and Noll [4].
2 The Discrete Laguerre Transform
The DLT is a discrete transform derived from the set of functions, orthonormal over (0,1), known as the Laguerre functions; the n'th Laguerre function is given by
p
ln (p; x) = (?1)n 2pn(2px)
(1)
where n (x) = e?x= Ln (x), Ln (x) = enx dxdnn (xne?x ), and p is a nonzero constant. If we wish to derive the N by N unitary DLT matrix, we must rst nd the N distinct roots of the polynomial LN (x), fxig; 0 i < N. This can be done by sampling the Laguerre functions from the 0'th function to the 2
!
Supported by NASA Grant NAGW 3293 obtained through the Microelectronics Research Center, The University of New Mexico
(N-1)'th function at the points fxig and forming the matrix L0N , whose entries are L0N (i; j) = li (p; xj ) 0 i; j < N (2) This matrix has orthogonal rows; however, one can nd a set of constants fAig such that if these constants are multiplied by the columns of L0 N , the resulting matrix LN is unitary [5]: LN (0 : (N ? 1); i) = Ai L0N (0 : (N ? 1); i) 0 i < N (3) Given a data sequence, if we divide it into N point segments and form each segment as an N by 1 vector x, the N-point DLT of each segment lN is found as lN = L N x (4) The inverse transform can be accomplished by the following equation: x = LT (5) N lN where LT N is the transpose of LN . It has been shown that the DLT is invariant with respect to nonzero values of p [5]. There are algorithms one may use to decrease the computation involved in this transform; these are discussed more fully in [5].
3 The Speech Coding Algorithm
The transform-coding algorithm given in [4] is brie y described here. The fundamental goal is to form an optimal bitmap for each block of data. Huang and Schultheiss [6] formed an optimal bit assignment rule for blocks of Gaussian random variables. If one assumes that the result of the transform operation is roughly Gaussian in nature, then this same rule can be applied. If the variance of the k-th element of each tranformed block is is denoted by k , then the optimal bit assignment rule for the k-th coecient is [6] bk = bavg + 12 log ( QK ?k 1 ) bits=sample (6) ( j j ) K 2
1
=0
where bavg is a desired bitrate per coecient. Once the desired number of bits is assigned, then one can use an optimal quantizer to code the transformed values, such as a Lloyd-Max quantizer [7]. This bit assignment rule
is not guaranteed to result in integer values; rounding values to the nearest integer or negative bit assignments to zero could compromise the desired bitrate. Moreover, in handling these problems to achieve the desired bitrate, several re-optimization schemes can be used [8], each yielding dierent results; choosing one particular method requires eort. Therefore, in [4], the \water lling procedure" [8] was proposed to handle this. This involves disregarding transformed values whose bit assignment using (6) is smaller than the average distortion. It has been shown that an effective approximation for the average distortion D is [6] D Q2? b 2
Y )K avg ( K ?1 i=0
i
1
(7)
where Q is a constant whose value depends on bi; its values are given in the tables provided in [7] for optimal Gaussian quantizers. From this de nition, it is clear that the average distortion measure can change with each coecient; however, the constant Q varies slowly with bi , and at that in a relatively small range.
3.1 Spectrum Determination
Estimation of the variance values i seems to be a formidable task. However, speech is highly nonstationary, meaning that maintaining long-term statistics for speech data is useless, as the statistical behaviour of speech is highly localized. Zelinski and Noll [4] proposed using the squared magnitude of each transform value as the local estimate of the coecient variance. However, the transmitter and reciever will require this information in order to determine the number of bits used for coding each coecient. Therefore, what was proposed was to downsample the spectrum, and then upsample by linear interpolation. Recall that the spectrum is only used to determine the number of bits assigned to each transformed coecient, i.e. the number of quantization levels assigned to each coecient, and is not used to determine the actual value of the coef cient. Therefore, the eect of introducing error into formulation of the spectrum, which is a necessary sideeect of downsampling and interpolation, is alleviated. What was proposed is to average every MN values into one value. Moreover, Zelinski and Noll proposed that interpolation be performed in the logarithmic (base-2) domain, so that a smoother spectrum is realized. The values that represent the samples of the original spectrum through which the estimated spectrum is created are referred to as side information.
3.2 Re-optimization of the Bit Assignments
The bit assignment rule in (6) suers from the problem of re-optimization due mainly to the constraint of nonnegative integer values for bit assignments. Rounding to nonnegative integer values can cause the average bit rate to increase. As a result, several re-optimization schemes have been developed. One of the re-optimization schemes involves discarding coecients whose variance is smaller than the average distortion. Sometimes this still yields an excess of bits assigned to high-variance coecients.
This problem can be alleviated by capping the number of bits assigned to any particular coecient [8]. However, even this x cannot guarantee an excess of bits assigned to transformed values, thus comprimising the desired bitrate. What is proposed in these cases is to subtract bits from transform coecients which have been assigned the maximum number of bits allowed, until the speci ed bitrate per block is achieved. Progressing from the lowest to highest frequency coecients, exactly one bit is subtracted from \maximally-quantized" coecients until the desired bitrate is achieved.
3.3 Coding of Side Information
The coding of the side-information in uences the compression performance of transform-based methods. One needs to nd a coding method which not only provides signi cant compression, but also does not increase coding losses beyond acceptable levels. Another aspect of side-information coding is that the distribution of the side information changes, depending on the type of transform used. To illustrate, the average log-spectrum for a 128-point DCT of a 4.13 second speech signal (8 kHz, 8 bit quantization) is plotted in Figure 1, and for a 128-point DLT in Figure 2. The distributions are very dierent, as the DLT's log-spectrum appears linear, while the logspectrum of the DCT looks closer to a Laplacian or Gaussian distribution. One method that has been suggested for coding the side information associated with a DCT operation is to code on a logarithmic scale [1]. This would mean having discrete quantization levels on an exponential scale; for instance, a possible two-bit quantization scheme could use the levels f:001; :01;:1;1:0g. Two problems with this type of quantization scheme are: (1) the quantization levels must be experimentally determined, and, (2) as seen by observing the distribution of the DLT log-spectrum in the previous example, the exponential distribution is not necessarily applicable to the log-spectrums of other transforms. One could also used a codebook to determine the quantization of the side information, but this method requires training samples, and the performance of such a scheme worsens when speech samples outside the training set are coded. It is preferrable to once again use a Lloyd-Max quantizer optimized to a Gaussian distribution. The motivation for this is that the Lloyd-Max quantizer does not require extensive training, unlike the previously mentioned methods. If the coding losses do not increase by a signi cant amount using Lloyd-Max quantizers, it can be deemed acceptable.
4 Simulations
21.43 seconds of speech from three male and three female speakers was coded using the optimal bit assignment on the transform coecients, and a LloydMax quantizer on the side information (at two bits per value). The transform size used was 128 points, with sixteen support values per block (as suggested in [1]). The speech was coded at dierent bitrates from 2000 bits/second to 16000 bits/second. The criterion for performance is the signal-to-noise ratio (SNR). For a discrete signal x[n] of size N and a corresponding
quantization error signal e[n], the SNR is de ned as [10] PN ? (8) SNR = PiN ? x [n] e [n] i The SNR results (in decibels) for each transform are given in Table 1, from which it is apparent that the DLT outperformed the DCT at almost all bitrates. However, when three bits per support value were used, the results given in Table 2 show the DLT outperforming the DCT only at the lowest bitrate, 2000 bits/second. Assuming no side information coding at all (i.e., assuming the side information is transmitted through a channel with more than sucient capacity to not require coding), the results in Table 3 show the DCT outperforming the DLT at every rate but the 2000 bits/second rate. This demonstrates that the DLT consistently outperforms the DCT at 2000 bits/second for this example. Moreover, the DLT requires two bits per support value to achieve good coding results, while the DCT requires three or more bits; this is signi cant since the side information can occupy signi cant channel capacity at low bitrates. One must note that the method to code the side information (Lloyd-Max) is more amenable to DLT side information than DCT side information. We note that vector quantization schemes can alleviate the problem of side information coding [2]; however, any VQ-based scheme suers from extensive training, computationally intensive codebook searches, and lack of robustness when coding data from outside the training set. Next, the DLT and the DCT were compared with respect to coding 30 seconds of a conversation between two male speakers (8 kHz, 8 bit quantization), with the results including side-information coding at 2 bits per value given in Table 4, at 3 bits per value in Table 5, and with no side information coding in Table 6. The DLT performed signi cantly better than the DCT in this case for 2 bits per support value. However, the DCT outperformed the DLT with 3 bits per support value and with no side-information coding, except for the 2000 bits/seconpd case. It is also of interest to note that there is considerable background noise in this speech sample. 1
2
=0
1
2
=0
5 Conclusions
In this work, the DLT was compared to the DCT with respect to a set of speech samples under a classic transform-coding scheme for speech. The DLT performed comparably well at low bitrates using this type of coding. Much work remains to be done before anything conclusive can be stated. First, the DLT and DCT should be compared with respect to the coding of phonemes. Second, the DLT and DCT should be compared with respect to much larger databases (which is accomplished in [5]). Finally, vector quantization schemes should be explored with respect to the DLT.
References
[1] Rainer Zelinski and Peter Noll. \Approaches to Adaptive Transform Speech Coding at Low Bir
[2]
[3] [4] [5] [6] [7] [8] [9]
[10]
Rates." IEEE Transactions on Acoustics, Speech and Signal Processing. Vol. 27. No. 1. February, 1979. pp. 89-95. Takehiro Moriya and Masaaki Honda. \Transform Coding of Speech using a Weighted Vector Quantizer." IEEE Journal on Selected Areas in Communications. Vol. 6. No. 2. February, 1988. pp. 425-431. Jose M. Tribolet and Ronald E. Crochiere. \Frequency Domain Coding of Speech." IEEE Transactions on Acoustics, Speech and Signal Processing. Vol. 27. No. 5. October, 1979. pp. 512-530. Rainer Zelinski and Peter Noll. \Adaptive Transform Coding of Speech Signals." IEEE Transactions on Acoustics, Speech and Signal Processing. Vol. 25. No. 4. August, 1977. pp. 299-309. Giridhar Mandyam. The Discrete Laguerre Transform: Derivation and Applications. Ph.D. Thesis. The Unviersity of New Mexico. 1995. Y. Huang and P.M. Schultheiss. \Block Quantization of Gaussian Random Variables". IEEE Transactions on Communications Systems. Vol. 11. 1963. pp. 289-296. Joel Max. \Quantizing for Minimum Distortion." IRE Transactions on Information Theory. Vol. 6. March, 1960. pp. 7-12. N.S Jayant and Peter Noll. Digital Coding of Waveforms: Principles and Applications to Speech and Video. ppEnglewood Clis, NJ: Prentice-Hall, Inc., 1984. Yunus Hussain and Nariman Farvardin. \Adaptive Block Transform Coding of Speech Based on LPC Vector Quantization." IEEE Transactions on Signal Processing. Vol. 39. No. 12. December, 1991. pp. 2611-2620. L.R. Rabiner and R.W. Schafer. Digital Processing of Speech Signals. Englewood Clis, NJ: Prentice-Hall, Inc., 1978.
DCT (dB) DLT (dB) 1.11 2.32 4.20 4.44 8.08 7.73 9.67 10.79 10.92 12.76
Table 1: Performance of First Example at Two Bits per Support Value Bitrate 2000 4000 8000 12000 16000
DCT (dB) DLT (dB) 1.12 2.35 5.38 4.37 10.54 8.16 13.64 11.06 15.65 13.45
Table 2: Performance of First Example at Three Bits per Support Value Bitrate 2000 4000 8000 12000 16000
DCT (dB) DLT (dB) 1.13 2.32 5.34 4.33 11.64 8.10 15.83 11.08 18.85 13.57
Bitrate 2000 4000 8000 12000 16000
18 16 14 12 10 8 6 4 2 0 0
20
Table 3: Performance of First Example With no Coding of Side Information DCT (dB) DLT (dB) 0.61 2.36 4.24 4.29 6.08 7.48 6.54 10.19 7.12 12.20
Table 4: Performance of Second Example at Two Bits per Support Value Bitrate 2000 4000 8000 12000 16000
DCT (dB) DLT (dB) 0.46 2.40 6.39 4.30 9.62 7.79 10.79 10.43 11.73 12.76
Table 5: Performance of Second Example at Three Bits per Support Value
40
60 80 Coefficient
100
120
140
120
140
Figure 1: DCT Spectrum 7
6
5 Average Log Amplitude
Bitrate 2000 4000 8000 12000 16000
DCT (dB) DLT (dB) 0.42 2.42 7.26 4.32 12.03 7.73 13.95 10.48 16.19 12.89
Table 6: Performance of Second Example With no Coding of Side Information
Average Log Amplitude
Bitrate 2000 4000 8000 12000 16000
4
3
2
1
0 0
20
40
60 80 Coefficient
100
Figure 2: DLT Spectrum