[3] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication ... [7] N. S. Jayant and P. Noll, Digital Coding of Waveforms,. Englewood Cliffs ...
Spectral Entropy and Coefficient Rate for Speech Coding Jerry D. Gibson, Steven P. Stanners, and Stan A. McClellan Department of Electrical Engineering Texas A&M University new indicator, the spectral entropy. We see these measures as complementary to the speech classification methods, and both approaches will likely be combined in a final design. In Section 2 we define the entropy rate and the AIC, and in Section 3 we provide a brief development of the spectral entropy. Section 4 contains results comparing the entropy rate and the spectral entropy for autoregressive (AR) processes, and comparisons of the entropy rate, AIC, and the spectral entropy for speech are given in Section 5 .
Abstract Variable rate speech coding is well-suited for network and wireless communications and is necessary to maintain good speech quality and intelligibility at everlower rates. in 1960 Campbell [ I ] used a version of the asymptotic equipartition property (AEP) to derive a relationship between the entropy of the source power spectral density and the minimum coeficient rate required to encode the source. We analyze Campbell’s coeficient rate expression and investigate its properties for autoregressive (AR) processes and for speech. We compare the coeficient rate to the familiar entropy rate power, and to the AIC model order criterion [IO], and consider these quantities as rate indicators for dynamically varying the rate of speech coders.
2: Entropy Rate and the AIC To develop these rate indicators, one possible approach is to vary the rate with respect to some fbnction of the source power spectral density. One such function is the entropy power [3,4] or entropy rate power [5] of a stationary Gaussian source denoted here as
1: Introduction
Q1 = ex”&
Variable rate speech coding is becoming the focus of numerous research efforts. One motivation for this interest in variable rate coding is the success of QUALCOMM’s QCELP speech coder [2] which greatly enhances the performance of CDMA in cellular mobile radio applications. A second motivation is the realization that the evolution to ever-lower rate speech coders requires variable rate structures to sustain the necessary speech quality and intelligibility. A third motivation is the desirability of adaptively reallocating bits between source and channel coding for operation over noisy and fading channels. There are other important reasons as well, but once variable rate coding becomes a goal, the question is, what are some useful indicators or cues that tell how the rate should be varied? One important approach to developing rate indicators is to use speech classification or phonetic segmentation, where speech frames are classified as voiced, unvoiced, transitional, or silence [ll]. Here we study alternative, more signal processing based rate indicators that are less speech-specific, such as the entropy rate power, the AIC model order criterion and a
”~
.
”
(1)
The right side of (1) is also the minimum mean squared prediction error (MMSPE) for one step ahead linear prediction of a stationary process with psd S(w) [6]. For an autoregressive (AR) process, the Mh4SPE is easily computable [7], and for a Gaussian AR process, Ql = MMSPE . This fact coupled with the result that the rate distortion function for a source bandlimited to W Hz. subject to the mean squared error fidelity criterion is lower bounded by Wlog2(Ql/D) [3,5], points out the importance and widespread attraction of Ql as a performance indicator. To address the problem of model order determination in time series analysis, Akaike [IO] proposed calculating ~14 = -2 ~log ) L( 6) + 2r (2) where the model has r parameters 6 = 6 is the maximum likelihood estimate of 8 , and L(*) is the
925
10.58-6393/93$03 00 0 1993 IEEE
jlnS(w)dco}. -n
likelihood function of 8 . We are primarily interested here in AR models, in which case N q r ) =nlog6,2 +2r (3) where 6; is the estimated mean squared prediction error. Given a frame of data, AIC(r) is calculated for all t' within some range and the model order with the smallest AIC is chosen as best. Note that the first term in (3) is just Q1 (for Gaussian AR sources) and the second term is a penalty for increasing the number of parameters. In systems where the parameters must be coded and transmitted, the presence of the second term seems particularly appropriate.
(7) The entropy rate power Ql in (1) is most used for bit allocation within a frame or block for fixed rate encoders. Does the coefficient rate Qz offer an alternative indicator for rate allocation?
4: Ql and
An autoregressive process, also called the linear prediction model, serves as the basis for the very successful class of speech d e n denoted as predictive coders. We have shown that for the first order GaussMarkov process with psd
3: Spectral Entropy
S( 0 ) =
Another possible indicator for dynamically varying the speech coder rate is the coefficient rate quantity derived over 30 years ago by Campbell [l]. See also Abramson [8]. For N sample functions x i ( t i ) , i = 1,2 ,..., N , 0 5 ti I T, of a stationary random process with psd S(w) , Campbell expands the product of these sample functions in a Karhunen-Loeve expansion over the cube OS ri < T, i = 1,2,..., N . Using an AEP (asymptotic equipartition property) approach [9], he shows that for N large, this expansion can be partitioned into two sets, one set with average power close to that in the original product, and another set with low average power. The number of terms or coefficients in this "typical" set is about p = 2 NH('),
Q2 for Autoregressive Processes
1-r 2
(8)
1-2rcoso + r 2' -1 < r < 1, Q, = Q2 = 1- r2 (surprisingly). However, for second order AR processes, Ql f Q2,and indeed, neither Q, or Qz is always greater than the other. See Table I. From plots of the spectral densities in Table I, it is evident that Qz is a measure of the spread of the source psd, and that it is a different indicator than Ql. For the second order AR sequence s ( k ) = a I s ( k - l ) + a 2 S ( k - 2 ) + w ( k ) , QI and Q2 are plotted for all stable values of a1 and a2 in Figs. 1 and 2, respectively. Figure 3 is a plot of Ql minus Q2,which makes their differences clear. Many speech frames have (a1,az)pairs that fall in the lower deep "null" region.
where
CO
5: Speech Experiments
(4) k=l
and the { hk ,k = 1,2,. ..} are the eigenvalues of the process [1,8]. Using a result of Grenander and Szego [ 6 ] , Campbell shows that
which is a per component, per unit time version of log p . He then argues that a natural measure of the coefficient rate of the random process is I . 0 0 I (6) Q2 = expi. I ! . logS ( o ) h } . 2n -m Compare Q2 to Ql in (1). In particular, for a bandlimited psd with normalized frequency, JwJSn,we can write (with some abuse of notation)
jS(o)
926
Experiments on speech data were performed to compare and contrast the differences among Q,, AIC, and Q2. To calculate the spectral entropy, we used the coefficients from the discrete cosine transform (DCT) over the frame of speech data. Figure 4(a) is a plot of a time domain speech waveform. Figures 4(b), (c), and (d) are plots of the minimum AIC(r), DCT energy, and the spectral entropy, where each quantity is calculated over 10 msec. frames. Figure 4(e) plots the AR model order corresponding to the minimum AIC. These quantities are plotted in solid lines for the full speech band, for the lower half band in dashed lines, and for the upper half band with dotted lines. The spectral entropy seems to be more sensitive to variations in the speech waveform than the AIC or energy, which one can conclude by examining the last 5000 samples. The split band results reveal that the spectral
entropy and the energy in the upper band are relatively flat but that the high band AIC vanes rather dramatically. The spectral entropy in the low band shows significant variation that differs from the other quantiites. A further separation of the low band into two bands should provide additional insights.
[4] R. E. Blahut, Principles and Pmctice of Information Theoy, Reading, MA: Addison-Wesley, 1987. [SI T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ: Prentice-Hall, 1971. [6] U. Grenander and G. Szego, Toeplitz Fonns and Their Applications, Berkeley: Univ. of California Press, 1958.
Acknowledgments This research was supported, in part, by the National Science Foundation under Grant Nos.NCR-9104566and NCR-9303805.
[7] N. S. Jayant and P. Noll, Digital Coding of Waveforms, Englewood Cliffs, NJ: Prentice-Hall, 1984.
[8] N. Abramson, "Information theory and information storage", Proc.of S p p . on System Theory, New York, NY, April 20-22, 1965, vol. XV, pp. 207-213, Polytechnic Institute of Brooklyn, Brooklyn, NY.
References [I] L. L. Campbell, "Minimum coefficient rate for stationary random processes", Information and Control, vol. 3, pp. 360-371, 1960.
[9] T. M. Cover and J. A. Thomas, EIements of Infomation TheoT, New York: Wiley, 1991. [IO] H. Akaike, "Information theory and an extension of the maximum likelihood principle, Research Memorandum No. 46, Inst. of Stat. Math., Tokyo, 1971.
[2] QUALCOMM, Inc. Digital Cellular System CDMA-Analog Dual-Mode Mobile Station-Base Station Compatibility Standard, March 5, 1992.
[ 1 I] E. Paksoy, K. Srinivasan, and A. Gersho, "Variable rate speech coding with phonetic segmentation", Proc., ICASSP '93, Vol. 2, pp. 155-158, Minneapolis, April 1993.
[3] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, Urbana, University of Illinois Press, 1949.
TABLE I J
Pole Locations 0.7575kj0.4221 0.5,0.5 0.8633,-0.4633 -0.8633,0.4633
Qi
0.1096 0.3375 0.4666 0.4666 0.7714
-0.2 & j0.6
927
Q2
0.2874 0.4214 0.3804 0.3804 0.7979
Figure 1. Q1 for AR(2)
Figure 2. Q 2 for AR(2)
Figure 3.
Q1- Q2
928
for AR(2)
I
0 0 0 CO rl
T
0 0 0 ul ri
0 0 0
TP rl
0
0
t
0 (U
rl
0 0 0 0 rl
0
0 0 CO
0 0 0 ul
0 0 0
-3
0 0 0
cv
0
0
W
n
d
W
0
0
In
-a n
e
0 VI
0 (U
n
n
E,
0
W
929
0 rl
0
n Q)
W