UNIVERSAL TRELLIS CODING OF SPEECH Vojin Šenk and Predrag Radivojac University of Novi Sad, Department of Engineering, Trg Dositeja Obradovia 6, Novi Sad, Yugoslavia e-mail:
[email protected];
[email protected]
ABSTRACT In order to incorporate adaptive features into a speech trellis coding system, a new universal procedure is proposed. The algorithm uses a bank of trellis encoders, each of which is adjusted to a different type of speech stationarity. A long training sequence is initially segmented and then a long linear code is found for all modes. Encoding is performed universally: the index of the best encoder and its path-map sequence are passed to the channel. Code search and encoding employ the M-algorithm that keeps a small fixed number of paths in contention. At rate 2 bits/sample we report a 4 dB improvement in signal-tonoise ratio over conventional trellis coding as well as over vector quantization. Approximately half of the improvement is due to the code memory extension and the rest to universality. Column distance profiles of the source codes are analyzed.
I. INTRODUCTION Trellis coding is a proven technique for source coding. It can be considered in terms of an encoder-decoder pair. Decoder consists of a finite state machine driving a tablelookup codebook of reproduction values. Encoder is a trellis search algorithm that chooses the channel sequence so as to minimize the distortion between the input sequence and the decoders output sequence. Generally, there are no structural restrictions on the finite-state machine and the most commonly used algorithm for the design of trellis source codes was proposed by Stewart et al. [1]. It uses the generalized Lloyd algorithm and requires sufficiently large training sequence. Simulation results indicate that, for a relatively small number of states, trellis encoders perform as well as, or better than other coding systems of the same, low complexity. Anderson and Law [2] and Stewart et al. [1] reported that increasing the constraint length over 4 or 5 cannot improve the performance of the system. This results as a logical consequence of the fact that a communication system is to be designed for the speech source that is statistically ill-specified. This means that it has an inaccurate and incomplete probabilistic model. Speech waveform is considered to be generated by a quasi-stationary source, i.e. its short-time behavior is stationary and ergodic but modes of stationarity occasionally change [3]. It is estimated that speech source has approximately fifty stationary modes with holding times ranging from 10 to 400 ms [2]. For such a class of sources, where, under some topological conditions, both on source and the distortion measure, the number of subsources can be approximated by a finite and relatively small collection of subsources, nearly optimal performance can be accom-
plished using universal coding. Universal coding refers to all techniques where the performance of the code selected without knowledge of the unknown true source converges to the optimum performance possible with the code specifically designed for the known true source. Technique of approximating a source with a finite set of stationary ergodic subsources and forming a union code is one of the universal coding techniques. The basic approach to universal code construction is then to select a finite number of representative subsources and to design a separate subcode for each. If the number of subcodes is small, then for a negligible increase in code rate a significant improvement can be obtained. The problem of universal coding is thoroughly analyzed in [4] [5]. Theoretically, such schemes are matched to sources where nature chooses a subsource once forever, but it is also allowed to change subsources provided that they are switched slowly relative to the search time [4]. Unlike the common linear predictive coding, adaptation to each subsource means to 1) identify it in effect and 2) design an optimal corresponding subcode. All the subcodes that form the composite code are designed (or colored) prior to communication. During communication, a codeword is accompanied with the address that specifies the code used to form it. As the technology still rapidly goes further we consider a system with a relatively large number of subsources at medium code rate of 2 bits/sample. Also, assuming that stationary modes can be distinguished and separated, we consider codes with large constraint length (memory) for which the rate-distortion limit, although not known exactly, can be reached. For such storage/constraint length requirements, a structural restriction only to real-number convolutional codes must be imposed.
II. REAL-NUMBER CONVOLUTIONAL CODES It is well-known that the simplest class of convolutional codes for speech compression originates from the least mean-square (LMS) prediction, which is derived from the source autocorrelation function. One first calculates the long-term autocorrelation and the LMS prediction of a current source symbol and then incorporates them into a coding scheme as k
vt = ∑ a i vt −i + xt ,
(1)
i =1
where xt represents the quantized difference of the source sample ut and the first term on the right-hand side of (1); v t represents the reproduced symbol at time t. More precisely, this expression is not an LMS prediction based
too large and thus the variation of coefficients of A(z) and C(z). Nevertheless, such averaging is optimal for the design of a single code that covers all modes [6].
III. UNIVERSAL CODE CONSTRUCTION
Fig. 1. The decoder of a real-number convolutional code
coding scheme, it is a coding scheme inspired by LMS prediction. Prediction values v can also be obtained using the transversal filter C(z) = 1/(1 − A(z)), where A(z) represents the Z-transform of the recursive filter with coefficients a1 , a2,…, ak, computed from the first k lags of autocorrelation. It follows that ∞
vt = ∑ c i xt −i .
(2)
i =0
The memory length of c is potentially infinite, but the coefficient magnitudes in some cases fall off rapidly [2]. If c = c0 , c 1,…, c M the code is a real-number convolutional code and we will employ such codes in the sequel. Codes based on LMS prediction are analytically obtained and are intended for single-path search procedures; in fact they are optimal for such procedures. For multi-path searches there are codes, obtained by computer search, that perform better. Figure 1 represents a decoder of a real-number convolutional code. Coefficients of C( z) are normalized according to C ( z ) = p ⋅ C ’ ( z ), | C ’ ( z ) |= 1 . (3) It is very important to note that the scheme from Fig. 1 does not generate a linear code despite its linear structure, since alphabet X does not form a group under ordinary addition. The closure property is not satisfied. The result is that all the codewords do not form a vector subspace. The analysis of such codes through their distance profiles is more complex and implies the use of a super-trellis [10]. However, codes based on LMS prediction have two basic shortcomings. First, all of them have infinite memory due to the recursive structure, and cannot simply be truncated; a prefix of a good long generator is not generally a good short generator (a possible improvement is truncation with tail memory). Furthermore, since LMS prediction, whose best approximation is DPCM, can be viewed as a single-path search procedure, better performance can be accomplished using multi-path algorithms, with different codes. For such codes there is no analytical derivation and a computer search has to be carried out. Although this is a standard optimization problem, Anderson and Law [2] devised a completely empirical procedure to find good codes for multi-path searches. This algorithm is based on an assumption that all stationary modes can be covered by a single short code. They reported that memory order longer than 5 cannot improve the performance of a trellis encoder. This is the result of averaging the autocorrelation function over all stationary modes so that the variation of the higher autocorrelation lags becomes
To overcome this problem we first constructed an empirical procedure to extract stationary parts of a training sequence [7], i.e. to extract segments with approximately constant mean value and autocorrelation. This procedure gave us approximately wide sense stationary training segments. With this segments identified, we designed several good longer codes (M = 10) for each training segment using optimization techniques such as the genetic algorithm, Anderson-Laws algorithm [2], and an empirical proc edure proposed in [8] using the M-algorithm as a trellis search procedure (with M = 4). Among these codes we extracted the set of representatives (subcodes) on condition that a speech segment cannot be encoded by a representative with loss greater than ∆ = 1 dB as compared to the code best matched to it. The steps of this algorithm are: 0) Let S be the set of wide sense stationary training segments, C the set consisting of several good codes for each segment from S, and B an empty set containing codes accepted to be in a bank. Set i = 1. 1) Take i-th segment from S and find the best code from C matched to it. Let SNRC,i be the maximal SNR of the best code and SNRB,i the maximal SNR obtained using codes from B. If SNRC,i > SNRB,i + ∆, accept the best code from C and store it in B. 2) Set i = i + 1. If i ≤ card(C ) go back to 1; else, stop the search. Output B as the set of representatives. Unfortunately, such a procedure does not provide the minimum possible set of representatives in a single iteration. Assume we are given two segments S = {s1 , s2 } and corresponding best matched codes C = {c1 , c 2}. If c2 represents segment s1 within given distortion ∆ of the reproduced sequence v, c1 may not represent s2 within the same distortion. So, repeating the above procedure in the reverse order would certainly give the minimum set B = {c 2}. For large cardinality of C several different permutations may be required. The minimum set satisfying these conditions contains B = card(B) = 58 codes, obtained from two permutations. Naturally, this filter bank can be extended to 64 in order to use binary transmission rationally. The encoding algorithm is very simple, although time consuming on a sequential machine. For a block of L source samples, each of the B trellis encoders in the bank finds a scaling factor that provides the best matching of the input and reproduced sequence according to the mean-squared error. Having chosen the smallest mse, the best overall code is selected and the index of the best code, its scaling factor, and the path-map sequence are transmitted to the receiver. Formally, the steps of this procedure are: For each incoming speech segment containing L samples, repeat:
Fig. 2. Filter bank based source encoder 1) employing trellis search on a trellis generated by Ci (z), find the mean-squared error msei , (or SNRi ) and its corresponding scaling factor pi , i = 1,…, B; 2) choose the smallest msei , and refer to it as mseb (corresponding p is pb and the path-map sequence xb ), b ∈ [1, B]; 3) output the index b of the best code and the binary representations of pb and xb. This encoding system is shown in Fig. 2. At rate R = 2 bits/sample (and B = 58 sets of coefficients in the bank) we have performed a simulation to compare full search memoryless vector quantization, conventional trellis coding, and our universal procedure. We have used the same training and test speech sequences 190,000 and 85,000 samples long, respectively. Standard VQ and trellis coding are simulated in order to compare various techniques on the same set of data. Results are presented in Tables I-III and show the superiority of the new procedure for the negligibly increased code rate (6 bits per block as a side information - less than 3% of the total code rate). All results are in decibels. Table I Full search vector quantization, N - block length N 1 2 3 4 5 6
SNR [dB] 7.16 10.75 12.23 12.79 13.23 13.41
In addition to the signal-to-nose ratio (SNR), the performance of trellis coding is reported by tabulated segmental SNR (SEGSNR). For sampled speech, SEGSNR is calculated by grouping the data into nonoverlapping segments, computing the SNR for each segment, and computing the arithmetic average over all segments with energy above some fixed threshold T. That is 1 K SEGSNR = ∑ SNRk [dB] , (4) K k =1 where K is the number of segments (each corresponding to 20 ms) with energy greater than T and SNRk is the SNR of k-th such segment. Since the data contain no periods of
silence, a threshold of T = 0 was chosen. Segmental SNR is considered to be a more realistic measure of speech quality. Table II Comparison of SNR-s for conventional and universal trellis coding; M = 10, M = 8. L 50 100 150 200 250 300 350 400
Conventional trellis coding 13.96 13.87 13.91 13.21 13.89 13.88 13.94 13.43
Universal trellis coding 18.76 18.05 17.92 17.08 17.59 17.54 17.44 16.97
Table III Comparison of SEGSNR-s for conventional and universal trellis coding; M = 10, M = 8. L 50 100 150 200 250 300 350 400
Conventional trellis coding 15.49 15.16 14.78 14.32 14.27 14.16 13.93 13.64
Universal trellis coding 19.16 18.28 17.75 17.22 17.02 16.85 16.47 16.24
It can be seen from Tables I-III that the signal-to-noise ratio for the conventional trellis coding is independent of L, while for universal coding it decreases with the block length. Eventually, for very large block lengths, these procedures would reach approximately the same performance, if the conventional coder was included in the bank. This performance would also reach the memoryless vector quantization for sufficiently large block length N. The gain of the universal method is in the possibility to choose a more appropriate code for a given subsource. On the other hand, segmental signal-to-noise ratio falls off both for the conventional and universal coding. The reason for that is that the encoder searches for the path minimizing the
mean-squared error. Consequently, the greater the block length the greater the adaptation to the higher energy segments due to the fact that the scaling factor p remains fixed during the block of length L. SEGSNR is the distortion measure that penalizes such adaptation. Nevertheless, the difference in performance due to universality remains. A slight decrease in performance of the universal code with the block length L arises from the fact that more than one of the stationary modes can be included in the same block, causing the overall decrease in both SNR and SEGSNR. The change of the parameter M in the Malgorithm to M = 64 adds approximately 0.5 dB to the SNR obtained with M = 8. According to [9] we performed an informal subjective MOS test. MOS scores require lengthy subjective testing, but are widely accepted as a norm for comparative rating of different systems. Ten listeners were asked to classify perceived levels of distortion into either of the descriptive terms excellent, good, fair, poor, unsatisfactory, or into equivalent n umerical ratings in the range 5-1. Comparing the original sequence with the reproduced ones obtained by classical and universal trellis coding the MOS score for the classically and universally coded sequence is 3.83 and 4.44, respectively. The reproduced sound is natural, easily understandable and with just perceptible distortion. To determine the individual effect on the improvement we have also performed a simulation on 20 short stationary segments by employing the same search procedure and observing the SNR-s for the codes with memory M = 5 and M = 10. We report that the greatest difference is accomplished on vowels and semivowels, between 1.2 and 5.8 dB. On unvoiced segments which are slightly more correlated then the ambient noise the improvement is from 0.3 to 1.9 dB. On the average, the improvement is about 2 dB which means that very nearly a half of the total improvement of the proposed universal algorithm originates from the increased memory and speech partitioning into segments with different degree of correlation. The rest of the improvement arises from the universal encoding.
IV. TRELLIS CODE ANALYSIS The quality of a code in source coding can be measured by its ability to cover space of input sequences with smallest possible number of L-dimensional spheres and with minimal average distortion. As the number of codewords is fixed, we will focus on the analysis of codes with good covering properties. The covering ability of codes with minimal distortion may be approximated with its packing ability. As a measure of this property, distance spectrum is used. When the search conditions are favorable, the most important incorrect paths are those at distance dfree. However, if some of the suboptimal search algorithms are used, a very impor-
tant feature is the resynchronizing ability, i.e., the ability to get back to the correct path (the one closest to the input sequence). The column distance function of the code (forward CDF - CDFf) is its quality measure for sequential algorithms (including the M-algorithm), while CDF for the reverse code (CDFb ) may be considered as a means to estimate its resynchronizing potential [11]. Defining the column distance function only for the first M + 1 = 11 branches [12], CDFf and CDFb for a sample code are presented in Table IV as well as its lags c(i). A stationary voiced speech segment 1047 samples long was taken and the M-algorithm with M = 16 was used for the code construction. At rate 2 bps, a signal-to-noise ratio of 25.32 dB was obtained. Distortion measure is squared Euclidean distance normalized by p2 . Table IV Column distance profiles for a sample trellis code i 0 1 2 3 4 5 6 7 8 9 10
c(i) -0.136 -0.310 -0.446 -0.475 -0.308 -0.073 0.171 0.348 0.358 0.255 0.117
CDFf(i) 0.00391 0.00452 0.00460 0.00472 0.00540 0.00548 0.00577 0.00594 0.00623 0.00629 0.00710
CDFb(i) 0.00289 0.00360 0.00363 0.00404 0.00473 0.00476 0.00480 0.00503 0.00538 0.00568 0.00572
CDFf/b(i) 1.35 1.25 1.26 1.16 1.14 1.15 1.20 1.18 1.15 1.10 1.24
When the M-algorithm is used, the resynchronizing property of the algorithm must depend on the possibility to lose the correct path. When M is small, good compression can only be achieved using codes with large CDFf/CDFb ratios (good resynchronization properties), while for greater M, the probability of losing the correct path is smaller, so that better codes (with CDFf/b approaching 1) can be used. CDFf/b is defined as 1 M CDF f (m) CDF f / b = . (5) ∑ M + 1 m = 0 CDFb (m) In order to prove this, we have performed a simulation. Taking 19 different codes and M = 16 we obtained an average CDFf/b = 2.04. An interesting thing is that codes that reach a better SNR have approximately the same average CDF as codes that perform badly. For M = 1, the average CDFf/b equals 8.28. We have also analyzed the ability of the code to cover a test sequence. The simulation results appear in Table V and Table VI. Good codes designed for M = 16 do not have good resynchronizing properties so that their performance is drastically degraded decreasing M. All the codes designed for M = 1 perform equally well on test segments, but those codes cannot reach the abilities of codes from Table V.
Table V SNR-s [dB] obtained on training segment with M = 16, and test segment with various M, M = 10
Code
M = 16
code code code code code code code
1 2 3 4 5 6 7
Test sequence
CDFf/b
Training sequence
M = 16
M=8
M=4
M=1
1.19 2.13 1.69 2.68 1.45 1.64 2.13
25.32 25.21 24.74 24.60 25.04 14.78 12.60
23.54 24.31 24.55 23.94 24.22 21.87 21.56
22.80 23.62 24.41 22.50 23.70 21.21 21.15
22.74 14.96 22.84 17.43 22.90 20.99 20.21
5.46 6.22 5.65 6.01 7.74 18.36 16.90
Table VI SNR-s [dB] obtained on training segment with M = 1, and test segment with various M, M = 10
M=1
Code
CDFf/b
code 1 code 2 code 3 code 4 code 5 code 6 code 7
3.78 13.66 3.62 14.18 4.83 2.47 4.05
Training sequence
M = 16
M=8
M=4
M=1
20.62 20.59 21.66 20.68 20.66 9.33 11.15
22.76 22.37 16.66 22.57 23.49 21.29 22.28
22.77 22.32 16.61 22.41 23.39 21.11 22.08
22.40 22.01 16.40 21.64 22.25 21.07 21.59
20.57 19.28 15.46 19.95 20.51 19.49 20.73
V. CONCLUSION We have presented a procedure for designing universal trellis speech compression system based on a bank of trellis encoders. Major shortcomings of short nonadaptive trellis codes have been overcome by an adequately designed bank of trellis encoders based on stationary modes of the speech waveform. The bank is needed in order to track the nonstationarity of speech. Results show an improvement in SNR of the order of 4 dB, for the same rate. Approximately one half of the gain is yielded by increasing the code memory and designing the codes for stationary segments of speech while the other half belongs to the filter bank. Coding system shown here is raw in a sense that it just represents direct application of universal coding theory to speech source. We are sure that some kind of postfiltering and use of spectrally shaped distortion measure could improve the subjective quality.
REFERENCES [1] L.C. Stewart, R.M. Gray, Y. Linde, The Design of Trellis Waveform Coders, IEEE Transactions on Communications, vol. COM-30, NO. 4, April 1982. [2] J.B. Anderson, C.W. Law, "Real-Number Convolutional Codes for Speech-Like Quasi-Stationary Sources," IEEE Transactions on Information Theory, vol. IT-23, pp. 778-782, Nov. 1977. [3] J.S. Flanagan, Speech Analysis, Synthesis, and Perception, 2nd ed. New York: Springer-Verlag, 1972.
Test sequence
[4] R.M. Gray, Time-Invariant Trellis Encoding of E rgodic Discrete-Time Sources with a Fidelity Criterion, IEEE Transactions on Information Theory, vol. IT-23, NO. 1, Jan. 1977. [5] A.J. Viterbi, J.K. Omura, Principles of Digital Communication and Coding, New York, McGraw-Hill, 1979. [6] J.B. Anderson, J.B. Bodie, Tree Encoding of Speech, IEEE Transactions on Information Theory, vol. IT-21, pp. 379-387, July 1975. [7] M. Savic, P. Radivojac, A Heuristic Procedure for Extracting Stationary Modes of Speech, Proceedings of the XLI Yugoslav ETRAN Conference, vol. 2, pp. 200-202, Zlatibor, 1997. [8] P. Radivojac, S. Vucetic, An Improved Trellis Code Search Procedure for Speech Compression, Proceedings of the XLI Yugoslav ETRAN Conference, vol. 2, pp. 203-205, Zlatibor, 1997. [9] S. Wang, A. Sekey, A. Gersho, An Objective Measure for Predicting Subjective Quality of Speech Coders, IEEE Journal on Selected Areas in Communications, vol. 10, NO. 5, June 1992. [10] J.B. Anderson, T. Aulin, and C.E. Sundberg, Digital Phase Modulation, Plenum Press, 1986 [11] T. Aulin, Breadth-First Maximum Likelihood S equence Detection, submitted to IEEE Transactions on Information Theory, 1992, unpublished. [12] R. Johannesson, E. Paaske, Further Results on B inary Convolutional Codes with an Optimum Distance Profile, IEEE Transactions on Information Theory, IT-19, pp. 718-727, Nov. 1978.