2015 GlobalSIP 2015 -- Symposium on 3GPP Enhanced Voice Services
Memory-Less Gain Quantization in the EVS Codec Vladimir Malenovsky
Milan Jelinek
Department of Electrical Engineering University of Sherbrooke Sherbrooke, Qu´ebec J1K 2R1, CANADA Email:
[email protected]
Department of Electrical Engineering University of Sherbrooke Sherbrooke, Qu´ebec J1K 2R1, CANADA Email:
[email protected]
Abstract—The recent standard on Enhanced Voiced Services (EVS) contains two memory-less gain coding mechanisms achieving better performance than the prediction-based techniques used in 3GPP AMR–WB and ITU-T G.729 codecs. The EVS gain encoder uses joint vector quantization without the need of information from previous frames. Inter-frame prediction is replaced by alternative schemes based on sub-frame prediction or estimated average target signal energy. This eliminates the propagation of error inside the adaptive codebook and reduces the risk of artifacts in the recovery stage after frame error concealment. The results show that the EVS codec outperforms AMR–WB at all bitrates while keeping the same amount of bits required for gain quantization. Index Terms—Gain quantization, linear prediction, mean square error.
adaptive codebook d(n)
978-1-4799-7591-4/15/$31.00 ©2015 IEEE
x(n)
H(z)
_ gp
+
predicted gc0 gain mult. correction γ factor c(n) innovation codebook
I. I NTRODUCTION The EVS codec uses the ACELP mode for efficient encoding of speech signals. The excitation in the ACELP mode is composed of two sources, the adaptive part and the fixed (innovation) part. This is shown in Fig. 1. The adaptive part of the excitation is formed by means of closed-loop search of the adaptive codebook which, itself, is the past excitation. The encoding process is performed on a sub-frame basis. Traditionally, the gain of the adaptive excitation is found in conjunction with the gain of the fixed excitation by means of the Mean Square Error (MSE) estimation. Both gains are then quantized jointly by a two-entry Vector Quantizer (VQ). The gain of the adaptive excitation is provided to the quantizer directly as its value is restricted to the interval [0, 1.2]. Direct quantization of the gain of fixed excitation is inefficient as its dynamics can be very high. To reduce the dynamic range, the gain of the fixed excitation is predicted from previous values in the logarithmic domain and transformed into a correction factor as shown in Fig. 1. The correction factor is then quantized jointly with the gain of the adaptive excitation by the VQ. This gain quantization scheme has been widely used by several existing speech coding standards, e.g. AMR–WB [1] or G.729 [2]. When compared to direct scalar quantization of the gains it leads to a reduced number of bits needed by the quantizer while maintaining the same level of quantization error. In addition, the statistical distribution of the correction factor is close to Gaussian with the mean value of 1.0. This facilitates the training of the gain codebook and conversion of codevector values into fixed point format.
target signal t(n)
H(z)
× gc y(n)
weighted synthesis filter with zero memory states
×
_ +
e(n) error signal
Fig. 1. Sources of excitation signals in the ACELP mode.
However, there is one important drawback. In the decoder, when a frame is lost, the memory of the gain quantizer is used and updated during the frame error concealment. After the concealment, when the first good frame arrives, the decoder resumes normal operation. Unfortunately, the decoded correction factor is multiplied by a gain that is different than the true gain. The discrepancy can outlast for several frames, thereby creating an artifact in the synthesized signal. The paper is organized as follows. First, we describe the conventional gain prediction scheme. Section two contains the first proposed memory-less scheme based on intra-frame gain prediction and section three describes an alternative method based on estimated target signal energy. In the last section, we show some objective and subjective test results. II. P REDICTION OF THE G AIN OF F IXED E XCITATION The gain of the fixed excitation is related to the absolute energy in the current frame and is quantized in the form of a correction factor to some predicted gain. In logarithmic form, this is written as Γ = Gc − Gc0 , (1) where Γ = log(γ) is the correction factor, Gc = log(gc ) is the gain of the fixed excitation and Gc0 is the predicted gain. The
638
2015 GlobalSIP 2015 -- Symposium on 3GPP Enhanced Voice Services
predicted gain itself is calculated from the previous quantized values of the correction factor. That is K sfrm ˆ −k + G ¯ n − Gi , αk Γ (2) Gc0 =
W
P1
k=1
¯ n is the mean value of the normalized gain and Gi where G is the gain of the innovation codevector. Note, that we have used the subscript −k in (2) as a reference to previous subˆ as a reference to its quantized value. The total frames and Γ number of sub-frames is denoted Ksfrm . The mean value of the normalized gain can be estimated by running the codec on a large database. For example, the EVS codec uses the value ¯ n = 30 dB. The gain of the innovation codevector Gi is of G calculated as N 1 2 Gi = log (3) [c(n)] , N n=1 where N is the length of the current sub-frame and c(n) is the innovation codevector. Note, that the index n = 1 corresponds to the first sample of the current sub-frame. The reason for the subtraction of Gi is to obtain normalized solution, independent of the selected codevector. The prediction factors αk are usually distributed unevenly to put more weight on the recent values. For example, in case of four sub-frames per frame, we can set α1 = 0.5, α2 = 0.4, α3 = 0.3 and α4 = 0.2. The quantization of gp and γ is performed jointly by searching through the gain codebook. Each entry (codevector) in the gain codebook has two values corresponding to [gp , γ]. The searching mechanism is exhaustive and for each codebook entry the following energy criterion is evaluated N 2 E= [t(n) − gp x(n) − γgc0 y(n)] , (4) n=1
where t(n) is the target signal, x(n) is the filtered adaptive excitation, y(n) is the filtered fixed excitation and gp is the gain of the adaptive excitation. The quantization algorithm selects the codevector [ˆ gp , γˆ ] leading to the minimal energy. Note, that minimizing (4) with respect to gp and gc also has an analytical solution leading to the minimal mean square error (MMSE) between the target signal and the combined filtered excitation. The MMSE solution for the fixed excitation gain is given by c0 c3 − c1 c4 gc = , (5) c0 c2 − c24 where, in matrix notation, c0 = xT x, c1 = tT x, c2 = yT y, (6) c3 = tT y, c4 = xT y are the correlations between the target signal, the filtered adaptive excitation and the filtered fixed excitation. As an example, the correlation between the target signal and the filtered adaptive excitation is calculated as c1 =
N n=1
t(n)x(n).
(7)
a
P2
×
gˆ p *ˆ -1 VQ (.) b
+
VQ(.)
×
_ Gi
_ G c0
Gc
+ first subframe
_ Gc 0
* gp
+
Gc
*
second subframe
Fig. 2. Intra-frame prediction in a memory-less gain quantizer.
III. I NTRA -F RAME G AIN P REDICTION To overcome the problem of de–synchronization between the encoder and the decoder in case of frame erasures, we have proposed two alternative gain quantization schemes that have been applied in the EVS codec. The first scheme uses intraframe prediction of the fixed excitation gain which replaces the inter-frame prediction. It is applied at the lowest bitrates of 7.2 and 8.0 kbps. In this scheme, no information from previous frames is used by the quantizer. This type of quantizer is referred to as “memory-less” regardless of the intraframe memory containing the quantized values of correction factors from previous sub-frames. The concept of intra-frame prediction is schematically depicted in Fig. 2. The prediction in the first sub-frame in each frame is unique. As there is no information available to predict the value of the fixed excitation gain, we have tried using some quantities provided by the signal pre-processing module of the encoder. This includes e.g. the pitch, the voicing, the frame class, the spectral values, LSF coefficients, etc. In principle, these quantities are insensitive to the level of the input signal and thus have no potential to predict the fixed excitation gain. After running analysis on a large database we have noted a small improvement only with the frame class. Let’s denote the frame class as τ and assume it has a constant value over the whole sub-frame chosen from set [0, 1, 2, ..., C] where C is the maximum number of classes used by the codec. With τ being the only predictor used in the first sub-frame, the gain of the fixed excitation can be estimated by means of linear prediciton. That is Gc0 = a0 + a1 τ − Gi ,
(8)
where a0 and a1 are the predicition coefficients found by minimizing the quadratic form (Gc −Gc0 )2 on a large training database. In matrix notation, the minimization problem can be solved by ∂ 2 (9) Gc − P1 a + Gi = 0, ∂a
639
2015 GlobalSIP 2015 -- Symposium on 3GPP Enhanced Voice Services
where a = [a0, a1]T , Gc and Gi are both column vectors of length L, P1 is an L × 2 predictor matrix and L is the total number of frames in the database. The rows of P1 have the form [1, τ ]. Note, that the rows of Gc , Gi and P1 contain quantities corresponding to the initial sub-frames of each frame in the database. The solution of (9) leads to the MMSE estimate of the prediciton coefficients given by the Moore-Penrose pseudoinverse −1 T a = PT1 P1 P1 [Gc + Gi ] . (10) In the second sub-frame the predictor matrix is extended by the quantized gains of the adaptive and the fixed excitation from the first sub-frame. That is ⎤ ⎡ ˆ c[11] 1 τ[1] gˆp[11] G ⎢ ˆ c[12] ⎥ , P2 = ⎣1 τ[2] gˆp[12] G (11) ⎦ .. .. .. .. . . . . where the values in brackets have been added to stress out the subframe and frame index, respectively. Since the class information τ applies for an entire frame the bracket contains only the frame index. The vector of prediction coefficients in the second sub-frame, denoted b, is again found by means of the MMSE estimation, similarly as in (10). That is −1 T b = PT2 P2 P2 G c . (12) Note, that in the second sub-frame the gain of the innovation codevector Gi is not subtracted from the predicted gain because this quantity is inherently included in the quantized gain of the fixed excitation in the first sub-frame. The principle of extending the predictor matrix by the quantized values from previous sub-frames is followed in all subsequent sub-frames. The prediction coefficients are always found by means of the MMSE estimation. Thus, the number of columns in the matrices Pk where k > 2 grows by two in each subsequent sub-frame. The predicted gain of (8) is used to calculate the correction factor as in (1). Finally, the correction factor is jointly quantized with the gain of the adaptive excitation using the codebook searching mechanism based on the minimization of (4).
Gt0 [dB]
60 40 20 0
0
5
10
15 20 index
25
30
35
Fig. 3. Quantization levels of the estimated average target signal energy.
Here, we have used the index [k] as a reference to subframes. The target signal energy is decreased by subtracting the energy of the adaptive excitation. However, the gain of the adaptive excitation is not known until it is quantized jointly with the gain of the fixed excitation. Therefore, the energy of the adaptive excitation is only estimated from the normalized correlation. That is 1 (15) Gt0 = Gt − (Cnorm1 + Cnorm2 ) . 2 where Cnorm1 and Cnorm2 are the normalized correlations computed twice per frame on the perceptually weighted input signal [3]. The estimated gain of the target signal is then quantized with a scalar quantizer, usually with 3–5 bits. The quantizer covers the range from approx. –8 to 65 dB and the quantization points are distributed unevenly, as shown in Fig. 3. The exact position of the quantization points has been determined by the k-means clustering algorithm on a large corpus of clean speech at levels varying from –40 dBov up to −10 dBov. for the case of a 5–bit quantizer. Finally, the predicted gain of the fixed excitation is calculated by subtracting the gain of the innovation signal. That is Gc0 = Gt0 − Gi .
(16)
The predicted gain is then used to calculate the correction factor as shown in (1) and quantized jointly with the gain of the adaptive excitation using the VQ.
IV. E STIMATION OF TARGET S IGNAL E NERGY
V. E VALUATION AND T ESTING
In the second alternative scheme the gain of the fixed excitation is predicted based on an estimate of the target signal energy. It is applied at bitrates of 9.6-32 kbps. The gain of the target signal t(n) in the current sub-frame is calculated similarly as the gain of the innovation signal (3), i.e. N 1 2 Gt = 10 log10 (13) [t(n)] . N n=1
The two proposed schemes were thoroughly evaluated in the recently standardized EVS codec. The evaluation consisted in a series of objective and subjective tests conducted on a large clean speech database of eight English-speaking talkers (4 male, 4 female) at three levels, –36, –26 and –16 dBov. The purpose of these tests was to compare the proposed schemes with the AMR–WB IO mode of the EVS codec which is an enhanced and interoperable variant of the legacy 3GPP AMRWB codec [1]. The proposed quantizers were forced to exploit the same amount of bits as the AMR-WB codec at all tested bitrates. The LP analysis window of the AMR-WB codec was adopted and the ISF/LSF quantization was turned off for a fair comparison. All enhancements and post-processing algorithms in the AMR–WB IO mode were also turned off.
Note, that the summation covers one sub-frame only. The gain of the target signal in the entire frame is then calculated by averaging over all sub-frames in that frame. That is Gt =
Ksfrm 1 Gt [k]. Ksfrm k=1
(14)
640
2015 GlobalSIP 2015 -- Symposium on 3GPP Enhanced Voice Services
AT
MUSHRA 6.60 kbps
-26 D B OV 120
8.85 12.65 18.05 23.05
Meas.
AMRWB
EVS I
EVS II
SNR SSNR SNR SSNR SNR SSNR SNR SSNR SNR SSNR
7.21 6.48 9.53 8.41 12.37 11.02 15.03 13.48 16.72 15.08
7.98 6.93 10.12 8.78 2.53 10.93 14.95 13.18 16.49 14.65
8.05 7.03 10.1 8.87 12.6 11.08 15.12 13.44 16.76 15.03
Diff scores EVS I – EVS II – AMR-WB EVS I 0.77 0.45 0.59 0.37 0.16 -0.09 -0.08 -0.30 -0.23 -0.43
60 40
80
0.07 0.1 0.03 0.09 0.07 0.15 0.17 0.26 0.27 0.38
30
20
40 0
10
0 orig
AMRI WB IO
II
AB 7.50 kbps
20
# votes
6.6
Raw scores
score
Bitrate [kbps]
AB 6.60 kbps
# votes
TABLE I O BJECTIVE T EST R ESULTS FOR C LEAN S PEECH
0 I>II
I=II
III
I=II
I