IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
1309
Optimal Coding of Generalized-Gaussian-Distributed Frequency Spectra for Low-Delay Audio Coder With Powered All-Pole Spectrum Estimation Ryosuke Sugiura, Student Member, IEEE, Yutaka Kamamoto, Member, IEEE, Noboru Harada, Member, IEEE, Hirokazu Kameoka, Member, IEEE, and Takehiro Moriya, Fellow, IEEE
Abstract—We present an optimal coding scheme that parameterizes the maximum-likelihood estimate of variance for frequency spectra belonging to the generalized Gaussian distribution, the distribution covering the Laplacian and the Gaussian. By slightly modifying the all-pole model of the conventional linear prediction (LP), we can estimate the variance with the same method as in LP, which has low computational costs. Experimental results show that incorporating the coding scheme in a state-of-the-art wide-band audio coder enhances its objective and subjective quality in a low-bit-rate and low-delay situation by increasing the compression efficiency. Thus, this coding scheme will be useful in applications like mobile communications, which requires highly efficient compression. Index Terms—Arithmetic coding, audio compression, generalized Gaussian distribution, linear prediction, low delay, transform coded excitation.
I. INTRODUCTION
I
N SPEECH communication tools such as Voice over IP (VoIP) and the emerging Voice over LTE (VoLTE), one of the most important factors affecting their quality is how they compress the speech signals. With the advent of recent codecs such as 3GPP Extended Adaptive Multi-Rate Wide) [1], [2], MPEG-D Unified Speech and Band ( Audio Coding (USAC) [3]–[5] and 3GPP Enhanced Voice Services (EVS) [6], a coding paradigm has developed that adaptively switches modes between time-domain coding, which is specialized for speech and have been used in many conventional speech coders, and frequency-domain coding, which can easily incorporate perceptual modeling and maintain the quality of music-like signals, inputs which tends to be hard for the time-domain coding to show its high performance. This paradigm enables the coders to avoid the weaknesses of the individual codings in each domain and significantly enhances their efficiency. However, mobile communications require Manuscript received November 28, 2014; revised March 23, 2015; accepted May 07, 2015. Date of publication May 11, 2015; date of current version May 27, 2015. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Rongshan Yu. The authors are with the Nippon Telegraph and Telephone Corporation, Kanagawa 243-0198, Japan (e-mail:
[email protected];
[email protected];
[email protected]; kameoka@hil. t.u-tokyo.ac.jp;
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASLP.2015.2431851
coders to operate at a low bit rate with low delay, which is a handicap especially for the frequency-domain coder since it needs a delay to get sufficient frequency resolution for showing its merits of energy concentration and perceptual modeling. Therefore, in this paper, we aim to enhance the quality of the frequency-domain coding in low-bit-rate and low-delay situations by focusing on the statistical properties of frequency spectra and incorporating an efficient coding scheme, with a simple algorithm, for the spectra belonging to a specific class of distributions. In designing coders, we can choose the distributions to which the coding target is assumed to belong, and the more likely the actual target belongs to the chosen distribution, the more efficiently we can compress it by using entropy coding, which optimally allocates the bits according to the distribution. Here, the Laplacian distribution has been commonly used for audio coders; for example, USAC and EVS use it with arithmetic coding for entropy coding, and SHORTEN [7], [8], MPEG-4 Audio Lossless Coding (ALS) [9]–[16], and ITU-T G.711.0 [17]–[19] incorporate Laplacians by using Golomb-Rice coding [20]–[22] for the bit allocations. The Laplacian has certain merits stemming from its simplicity, but the distribution of the target may vary depending on acoustic properties of the target and on the conditions of the coding, frame length for the frequency representation, for example. To fully use the statistical properties of the target, we should be able to deal with a wider variety of distributions. The generalized Gaussian is a class of such distributions, and it is used in image and video analysis [23]–[26]. It has a shape parameter that makes the distribution represent a Gaussian or a Laplacian in specific cases, and previous studies on audio [27], [28] applied it to frequency spectra obtained by performing Modified Discrete Cosine Transform (MDCT) on audio signals. These studies had an assumption that the variance of the generalized Gaussian is uniform over the frequency bins, which usually does not match the case of natural signals whose energy tends to concentrate on some frequencies. The state-of-the-art frequency-domain coder, Transform Coded eXcitation (TCX) [29], [30], assumes a zero-mean Laplacian with a non-uniform variance over the frequency bins, with no covariance between the bins when performing entropy coding. It is obvious that coding all the variance for each frequency bin is nonsense, and thus TCX estimates the variance by using the spectral envelope, which is represented
2329-9290 © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/ redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1310
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
II. INTERPRETATION OF CONVENTIONAL LP with the all-pole
The LP models the signals representation:
(1) where are the model parameters, or LP coefficients and is the prediction gain, namely the variance of the prediction residuals. The model parameters that minimize the squared errors of the prediction residuals in the time domain are estimated from the input signals by solving the Yule-Walker equation:
Fig. 1. Illustration of entropy coding assuming a distribution with a non-uniform variance and no covariance. The spectral envelope extracted from the input MDCT coefficients gives the variance of the distribution for the coefficient in each frequency bin. Bits are allocated for each MDCT coefficient according to its likelihood.
by the all-pole model of linear prediction (LP), or the auto-regressive model, and is used for another purpose in the coder as well: for controlling the quantization noise according to a perceptual model. Fig. 1 illustrates the bit allocation of this entropy coding. Although this estimation of variance is valid to some extent, the theoretical investigations in this paper reveal that the envelope extracted by the conventional LP from the spectra of the target signal is not optimal for estimating the variance of the Laplacian. The main contribution of this study lies in the following: a coding scheme for the generalized-Gaussian-distributed frequency spectra, or MDCT coefficients, based on the idea of TCX, which estimates the variance for each frequency bin from the spectral envelope as in Fig. 1, and the idea of Powered All-Pole Spectrum Estimation (PAPSE), an optimal and simple method to extract the envelope from the spectra, in other words, to parameterize the variance. The model of the envelope introduced here is slightly different from the one of the conventional LP, but using almost the same method, we can calculate model parameters optimized for the generalized Gaussian distribution, which includes the Laplacian. Moreover, the envelope extracted by this method can still be used as the conventional envelope so that by employing this method instead of the conventional LP, we do not need to use extra bits to encode the model parameters. This paper is organized as follows. First, we briefly review and interpret the conventional LP in Section II. Then, in Section III, we describe the variance model and the variance parameterization optimized to the generalized Gaussian. A description of how the method is integrated into the actual coder is given in Section IV and an evaluation of the coder is presented in Section V. Since our aim is to use this coding , which scheme in hybrid coders, such as adaptively switch between time-domain and frequency-domain coding, the evaluations mainly use music signals, which is the major target of frequency-domain coding. Finally we discuss the merits and limitations of the method in Section VI.
.. .
..
.
..
.
..
.
.. .
.. .
.. .
(2)
where is the auto-correlation function of the input signal. can be obtained either from the correlation in the time domain or by inverse cosine transforming the power spectra of the input signal. In this paper, we consider the latter type to be the conventional LP. Equation (2) is a special case of the Toeplitz-type equations and it can be efficiently solved with the Levinson-Durbin algorithm while at the same time easily checking the stability of the model parameters. The model of eq. (1) represents the spectral envelope of the input signal in the frequency domain. We can easily deduce that the minimization problem of the LP, the minimization of the prediction errors, with frame length can be approximately written in another form in the frequency domain when : (3) (see Appendix A for the proof) where and are respectively the spectral envelope and the frequency spectrum of the signal for each frequency bin , and (4) is the Itakura-Saito (IS) divergence of from 1. and indicate the imaginary number and the natural logarithmic function, respectively. For each frequency bin , the terms in the objective function of eq. (3) relating to correspond to the negative log-likelihood of a Gaussian distribution with variance : (5) This leads to the following interpretation of the LP in the frequency domain: the LP assumes that the magnitude spectra belong to a Gaussian with non-uniform variance over the frequency bins, as in eq. (5), and parameterizes using LP coefficients the maximum-likelihood estimate of variance for all the frequency bins, which forms the spectral envelope, within the constraint of the all-pole model given in eq. (1). This is the 1We write here instead of divergence has asymmetry.
just to make sure that the IS
SUGIURA et al.: OPTIMAL CODING OF GENERALIZED-GAUSSIAN-DDISTRIBUTED FREQUENCY SPECTRA
reason why we mentioned in the previous section that the spectral envelope extracted from the spectra by the conventional LP is not optimal for estimating the variance of the Laplacian. In the following section, we extend the range of the situations in which the LP is applied to the spectra assumed to belong to the generalized Gaussian distribution and describe a method to extract from the inputs the spectral envelope which represents the maximum-likelihood estimate of variance of the distribution for all the frequency bins. The key idea lies in the fact that the minimization problem defined in eq. (3), i.e., the minimization of the IS divergence of the all-pole model in eq. (1), can be optimized by solving eq. (2) so that if we are able to write an optimization problem into the form of eq. (3), it can be easily optimized by the Levinson-Durbin algorithm. Moreover, it should be noted that the model parameters obtained from eq. (2) provide the problem of eq. (3) with a global optimum.
1311
envelope, and extracts the envelope from the input by a numerical method (called the auxiliary function method) to optimize for the assumed Laplacian in the time domain. That is to say, the algorithm for extracting the envelope depends on the model of the envelope, and if we fix the algorithm, we have to change the model of the envelope or the relation between the envelope and variance in order to optimize for the assumed distribution. Considering its use in a coder, the algorithm for extracting the envelope from the input should be as simple and low-cost as possible. With this in mind, we aim at designing an envelope model that enables us to use the Levinson-Durbin algorithm. To design such a model, we have to transform the optimization problem of the likelihood into the form of eq. (3). The negative log-likelihood for , which relates to the ideal description length, assuming the generalized Gaussian with variance for each frequency bin is written as
III. OPTIMAL CODING SCHEME
(9)
A. Generalized Gaussian Distribution The zero-mean generalized Gaussian distribution of a random variable with variance is defined as [23]–[28]:
and this equation can be transformed as
(6) where
is the shape parameter of this distribution, and (7)
with the gamma function: (10)
(8) By changing the shape parameter , we can represent the Laplacian at and the Gaussian at . In the following discussion, we assume a non-uniform variance for each with a fixed random variable, or MDCT coefficient, and derive a parameterization of the maximum-likelihood estimate of for given spectra.
where the variance as
B. Optimization algorithm
the model parameters log-likelihood , and as
to be real values since Here we consider the spectra they are MDCT coefficients, and optimize the likelihood of their variance. For this optimization, we have an option to choose: • What kind of model is used to represent the spectral envelope; • How the envelope is associated with the variance; • How the envelope is extracted from the input spectra. For example, TCX uses the all-pole model eq. (1) for the envelope, assumes the variance for each frequency bin to be the same as the power of the envelope, and extracts the envelope from the input spectra by the Levinson-Durbin algorithm, but this is not optimal for the Laplacian. The previous studies [14], [15] are other examples; they proposed LPs for optimizing the bit length for the Golomb-Rice coding of the prediction residuals in time domain on the basis of minimizing the norm of the residuals. This idea, if we interpret it in the frequency domain, uses the all-pole model for the envelope, equivalently assumes the variance for each frequency bin to be the same as the power of the
is a constant for . Thus, by modeling with a powered all-pole representation
(11) , which minimize the negative , the prediction gain, can be expressed (12)
and (13) These can be determined by the Levinson-Durbin algorithm by regarding as spectra, in other words, using for in the inverse cosine transform of eq. (2), (14)
1312
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
This operation assumes there is a signal which has exactly the same power spectra as and implicitly applies the LP to the signal only by using the information in its spectra. Moreover, the model parameters obtained above make match so that (15) represents the spectral envelope of . This results in the following relationship between the envelope and the variance: (16) Hereinafter, taking into account that the alternative model of variance is represented by a powered version of the conventional all-pole filter, we will call this parameterization method of the variance, or extraction of the spectral envelope from the input spectra, Powered All-Pole Spectrum Estimation (PAPSE). When the shape parameter of the generalized Gaussian takes , meaning a Gaussian, the maximum-likelihood estimate of variance is calculated from eq. (11) and (17)
the arithmetic coding, the target MDCT coefficients are scalar quantized for each frequency bin with the quantization step: (19) where are the LP coefficients of the input signal, and is the scale of quantization, i.e., the minimum value for the given bit rate, decided by a bisection search. , the spectral envelope smoothed by a constant , is an approximation of the perceptual weights, which are the weights that shape the quantization noise on the basis of the masking effect to make it less annoying, and is often used [1], [29], [30], [36]. Therefore, ignoring the rounding operation, the target of the arithmetic coding is written as (20) and its envelope is represented by (21) The coder allocates the bit length for each to its envelope , as follows:
as
(22)
(18) which matches the conventional LP. Additionally, when the shape parameter takes , meaning a Laplacian, the extraction of the envelope from the input is the same as the one in our previous study on optimizing Golomb-Rice coding [31]. The inverse cosine transform of in eq. (14), which appears in the algorithm, is known as the zero phase signal representation [32]–[34] in the noise reduction context, and its application to LP has been studied in other situations for the sake of its robustness against noise when a small is used [35]. IV. APPLICATION A. Arithmetic Coding in TCX The coding scheme presented in the previous section is designed to optimize the entropy coding for the generalized-Gaussian-distributed spectra. This coding can be realized in several ways: for example, through Golomb-Rice coding as in [31] if the shape parameter is or through arithmetic coding as in TCX [29], [30], the idea of which is reviewed in [36]. Here, to apply this optimization to various cases of , we focus on the TCX coder with range-coder-based arithmetic coding [16]. The encoder of TCX is, broadly speaking, composed of four processes: MDCT, which represents the input in the frequency domain, LP analysis, which extracts the spectral envelope from the MDCT coefficients, a scalar quantizer, which quantizes the spectra non-uniformly over the frequency bins according to perceptual weights, and the arithmetic coding, which assumes for the quantized MDCT coefficients a Laplacian with a variance based on the envelope as in Fig. 1. Before the coder performs
according
is defined as a one-sided cumulative distriwhere bution function of the Laplacian with a variance depending on the value of the envelope:
(23) and
in eq. (22) is for the sign of each MDCT coefficient.
B. Applying Optimal Coding To apply to arithmetic coding the coding scheme in Section III, we first show the relationship between the code length of arithmetic coding, which assumes a generalized Gaussian, and the negative log-likelihood of eq. (9). If we assume the variance of each frequency bin to be for the generalized Gaussian with fixed and scale , the bit allocation for weights of the arithmetic coding will be
(24) where
is defined as (25)
SUGIURA et al.: OPTIMAL CODING OF GENERALIZED-GAUSSIAN-DDISTRIBUTED FREQUENCY SPECTRA
with
of eq. (7) and the incomplete gamma function: (26)
In the case of , using the fact that only on the ratio for and , the bit allocation can be rewritten as
depends of eq. (24)
1313
all-pole model as is the conventional LP, the model parameters can be quantized using the conventional quantization algorithms for the LP coefficients such as in [37], i.e., the vector quantization of the coefficients in the form of Line Spectrum Pairs (LSP) [38]. V. EVALUATIONS A. Optimal Shape Parameter for Each Acoustic Situation
(27) and when the quantization is fine enough, or order approximation of for becomes
, the first-
(28) and so as in the case of length:
. This results in the total bit
(29) which is approximately proportional to the negative log-likelihood presented in eq. (9). When the quantization is sufficiently fine, minimizing the total bit length comes down to minimizing the negative log-likelihood, which can be optimized using the PAPSE and eq. (11) for . Moreover, to approximate , we can use the smoothed verthe perceptual weights sion of the envelope in eq. (15): (30) since this envelope in eq. (15) roughly has the shape of the magnitude spectra just as the conventional envelope does. Summarizing the above discussion, the algorithm to encode the frequency spectra goes as follows: 1) Calculate the model parameters and in eq. (15) by using the Levinson-Durbin algorithm in which eq. (14) is used for in eq. (2) instead of the inverse cosine transform of power spectra. 2) Approximate the perceptual weights by using eq. (30) instead of the ones in eq. (19). 3) Scalar quantize the spectra by the step size for each frequency bin . 4) Allocate the bits to the quantized spectra by using arithmetic coding according to eq. (24) with eq. (11) for the variance instead of allocating them according to eq. (22). These minor changes from TCX provide the coder with a globally-optimized compression of the scalar-quantized spectra belonging to the generalized Gaussian distribution of shape parameter the variance of which is constrained by the model of eq. (11). It should be noted that, since PAPSE is based on the
First, we compared the compression efficiency for fixed quantized spectra to see how the distribution of the frequency spectra varies depending on the acoustic properties of the signal and to examine the fluctuation in the dependence when different models of the variance were used: the uniform variance model, which is assumed in [27], [28], the conventional LP model with eq. (18), which is used in TCX, and the PAPSE model with eq. (11). The fixed spectra were prepared by quantizing the MDCT coefficients of a test signal using a coder based on TCX at 64 kbps and at 16 kbps, and they were compressed by arithmetic coding assuming a generalized Gaussian distribution. The variance for the generalized Gaussian was calculated using the following three methods: • For the uniform-variance model, the variance was given by the zeroth-order of the PAPSE, which means using the maximum-likelihood estimate of uniform variance for each frame; • For the conventional LP model, the variance was given by eq. (18) with the 16th-order LP coefficients for each frame; • For the PAPSE model, the variance was given by eq. (11) with the model parameters obtained by the 16th-order of the PAPSE for each frame where 16 is an often used prediction order [1], [29], [30]. For each case, without quantizing the model parameters, we compressed the fixed spectra by using for the shape parameter of the generalized Gaussian from to in increments of 0.1, and found that achieves the minimum description length in bits for each consecutive ten frames. Note that the choice of ten consecutive frames is to avoid meaningless outliers. In all the three cases, to fairly compare the effects of the models, we fixed the perceptual weight defined in eq. (20) to the one used to quantize the fixed spectra. The test signal, the spectrogram of which is shown in Fig. 2(a), was composed of four seconds each of a popular music (synthesizer), a classic music (violin), a jazz music (trumpet), a male speech (clean), and the same male speech with pink noise (signal-to-noise ratio was 10 dB). We used the musical signals in the RWC music database [39] and the speech signals in the ATR speech database [40] all down-sampled to a 16-kHz sampling rate. To make the condition similar to the real mobile communications, the frame length for the MDCT coefficients was set at samples, and the following window was used:
(31)
1314
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
Fig. 2. Spectrogram of test signal (a) and results for 64 kbps (b) and 16 kbps (c). The upper and lower halves of (b) and (c) respectively show the change in the optimal parameter and the ratio of the bits saved relative to the conventional setting.
with samples of foreseeing and , a truncated sine window presented in [41], which was designed for a low-delay coder. Under these conditions, the coder had 32 ms of algorithmic delay at the 16-kHz sampling rate. The results are illustrated in Fig. 2(b) and (c). The upper halves of the figures show the change in the optimal shape parameter for each consecutive ten frames, and the lower halves show the ratio of saved bits compared with arithmetic coding assuming a Laplacian with its variance modeled by the conventional LP (the conventional setting). Although the results of the optimal for the uniform variance model matched the ones of the previous studies in [27], [28], it is obvious that the optimal reflected the acoustic properties much more accurately when using a non-uniform variance for the generalized Gaussian, i.e., using higher values of for noisy or temporal sounds and lower
values for clean or tonal sounds. In terms of compression efficiency, although there were some situations in which tuning led to an improvement over the conventional setting even if the variance was assumed to be uniform, the assumption of uniformity failed in other situations and worsened the efficiency. In addition, in the case of the conventional LP model, the optimal was more affected by the change in the bit rate. Furthermore, in the case of the PAPSE model, the compression efficiency was greatly improved, especially for tonal sounds, whereas the improvement was much smaller in the case of the conventional LP model despite tuning . B. Effects of Shape Parameter on Compression Efficiency From now on, we will focus on the results for musical signals since TCX is intended for such input, on which it is hard for a
SUGIURA et al.: OPTIMAL CODING OF GENERALIZED-GAUSSIAN-DDISTRIBUTED FREQUENCY SPECTRA
1315
Fig. 3. Histogram of the optimal shape parameter (a), change in average bit length had by changing from the optimal one (b), range of that makes the bit length increase less than 3% from the optimal length (c), and average bit length when using the same for all the data (d). 100% in (d) indicates the average bit length when using the optimal for each consecutive ten frames. In total, 24720 frames of MDCT coefficients were tested.
time-domain coder to exert its advantages. The second experiment examined the following factors: • What kind of histogram the optimal has in different acoustic cases; • How many bits the coding consumes when the used in the coding is not the optimal one; • What the range of the optimal is when we permit the bit length to be increased; • Which is the best when we choose one to code all the spectra. We created quantized MDCT coefficients from ten-second samples of different fifty musical signals by using the same method as in the previous experiment at 16 kbps and compressed the quantized coefficients by arithmetic coding for the generalized Gaussian with the PAPSE using shape parameters ranging from to in increments of 0.1. The fifty musical signals were randomly selected from the four databases in the RWC Music Database [39]: ten items each from the Classical Music, Jazz Music and Music Genre Databases; twenty items from the Popular Music Database, ten without vocals and ten with vocals. Ten seconds worth of each signal was down-sampled at 16 kHz. Fig. 3(a) is the histogram of the optimal shape parameter that attained the shortest bit length for each consecutive
ten frames. As in the histogram, the optimal tends to be near , which does not conflict with the findings of our previous investigations [31], [41]. The curves in Fig. 3(b) were obtained by gathering the frames which attained the shortest bit length for , , and and plotting the average bit length for each case using the other values ranging from 0.1 to 3. It can be seen that all three curves have asymmetric forms, steeper on the left side, and the slope gets shallower as the optimal increases. In regard to the steepness, Fig. 3(c) shows the ranges of that causes at most a 3% increase in the bit length over the case of the optimal . In other words, each range in this figure corresponds to the range of that makes each of the curves in Fig. 3(b) lower than 103%. We can clearly see that the range gets larger as the optimal becomes higher; that is, the spectra belonging to the generalized Gaussian with higher are more tolerant, in terms of bit length, to the difference in used in the coding. Fig. 3(d) is an integration of Figs. 3(a), (b), and (c). This curve is a plot of the bit length when we used a single for compressing all the test spectra relative to the optimal bit length (100%) when we were able to tune for each consecutive ten frames. The curve remains asymmetric, and a suitable for compressing all the test spectra seems to be between 0.5 and 1 with being the optimum for this test.
1316
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
Fig. 4. SNR of the quantized spectra by shape parameter for arithmetic coding using each LP. In total, 24720 frames of MDCT coefficients were tested at 16 kbps.
Since TCX chooses quantization steps that make the compression results satisfy the given bit rate, it can be said that the higher the compression efficiency that arithmetic coding has, the finer the coder is allowed to quantize the input frequency spectra. In order to check if the quantization actually gets finer as a result of using PAPSE, we compared the Signal-to-Noise Ratio (SNR) of the quantized spectra in eq. (20), which is closely related to sound quality, while fixing the bit rate at 16 kbps and finding the scale with a bisection search. The same test data as above were compressed using arithmetic coding with one shape parameter for every frame. To make a fair comparison, the perceptual weight in eq. (20) was approximated using the conventional LP with eq. (19) for every condition. The results are plotted in Fig. 4. The peak of the curve for PAPSE seems to be between 0.5 and 1, with being the optimum for this test. On the other hand, tuning the shape parameter was less effective with the conventional LP, and this result agrees with the results of the experiment in Section V-A. The two curves of SNR match in the conditions near since PAPSE identifies with the conventional LP when , namely when assuming the Gaussian for the spectra. C. Effects of Optimal Coding Scheme on TCX To evaluate the capabilities of PAPSE and the generalized Gaussian, we incorporated the coding scheme in a TCX coder at 16 kbps and objectively and subjectively evaluated the sound quality of musical signals. The order of the LP was set at 16, and the frame length and window function were the same as those in the previous experiment, so the permitted algorithmic delay of this coder was designed to be 32 ms at a 16-kHz sampling rate. The model padefined in eq. (11) were quantized by a total rameters of 20 bits for each frame by using the method based on [37] with a different codebook created from a training data set for each condition, and the scale was logarithmically quantized into 8 bits. For the other model parameter , we assumed its ratio to the scale to be a constant that depends only on the bit rate and the shape parameter , since it is strongly related to the entropy of the generalized-Gaussian source. The determination of this constant was performed by taking the -power mean of ascertained from the training data, the justification of which is given in Appendix B. The training data set contained
Fig. 5. Database-wise PEAQ scores. Error bars indicate 95% confidence intervals.
speech and various genres of audio signals, different from the test data, totaling six minutes at a 16-kHz sampling rate. Perceptual Evaluation of Audio Quality (PEAQ) in AFsp [42] was used for the objective evaluation. The test data were the same as in Section V-B, and the following TCX conditions were compared: • using a Laplacian (generalized Gaussian with shape pa) whose variance was modeled the convenrameter tional LP model and the perceptual weight approximated in eq. (19), hereinafter called the convenby tional setting; whose variance • using a generalized Gaussian with was modeled the PAPSE model and the perceptual weight in eq. (29), hereinafter called approximated by the proposed setting. ,a We also prepared a benchmark by using standard coder whose algorithmic delay amounts to 144 ms when its internal sampling rate is set at 16 kHz [43]. It may seem unfair to use a shape parameter that was proved in Section V-B to be the best for representing the test data. However, our objective in this experiment is just to confirm the existence of a distribution that outperforms the Laplacian and to see the effects of the optimization in that case; thus we do not insist here on the importance of a specific value for the shape parameter. Fig. 5 displays the database-wise PEAQ scores for all conditions. The performance of TCX appeared to be enhanced by using the generalized Gaussian and PAPSE. Additionally, the paired t-test proved there were significant difference at 5%, totally and for each database, between the conventional and proposed setting. The subjective evaluation used ITU-R BS.1534-1, i.e., Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) [48], to compare the three conditions. Five audio items in the RWC Music Database, lasting ten seconds each and down-sampled at 16 kHz, were used. The test items were labeled as follows: “Classic”, for a harpsichord piece from the Classical Music Database, “African”, for a guitar piece from the Music Genre Database, “Jazz”, for a piano piece from the Jazz Music Database, “Popular”, for a synthesizer piece from the Popular Music Database, and “Vocal”, for a female vocal piece from the Popular Music Database. Nine participants, all engaged in audio signal processing research, were blindly provided with the coded items with the references and 3.5-kHz band-limited anchors, and they graded the quality of each coder from 0 to 100 points.
SUGIURA et al.: OPTIMAL CODING OF GENERALIZED-GAUSSIAN-DDISTRIBUTED FREQUENCY SPECTRA
1317
Fig. 6. Item-wise MUSHRA scores. Average and 95% confidential interval of t-distribution. (a) Absolute scores. (b) Relative scores of the proposed setting compared with the conventional setting. Asterisks indicate the existence of significant differences at 5% in a paired t-test.
Fig. 6 presents the absolute scores and the scores that were increased by the proposed setting. The coder with the conven, tional setting, which performed about as well as was enhanced by incorporating the generalized Gaussian and PAPSE optimized to it; two out of five items and the total score had significant differences at 5%. VI. DISCUSSIONS The results in Section V-A show the importance of three factors: carefully selecting the distributions for the target spectra, assuming a non-uniform variance over the frequencies, and optimally parameterizing the variance, or spectral envelope. The most interesting result is that only in the case of parameterizing the variance by the conventional LP do the optimal shape parameters fluctuated greatly (compare Fig. 2(b) and (c)). Since the conditions in Fig. 2(b) and (c) differed solely in terms of the bit rate, in other words, in terms of the step size of the quantization, one may expect that only the scale, not the shape, of the distribution is influenced by the quantization. However, this expectation only holds true so long as the variance of the target is properly estimated, and we conjecture that an improper variance, in the sense of the likelihood, parameterized by the conventional LP makes more difficult to describe the target spectra only by changing the shape parameter . Therefore, tuning the shapes of the assumed distributions may be in vain unless we parameterize the variance properly. Surprisingly, PAPSE, of which optimality was proven only in the case of sufficiently fine quantization, seems to have properly parameterized the variance even in the low-bit-rate situation (e.g. 16 kbps at a 16-kHz sampling rate), and it greatly reduced the description length. The results in Section V-B hint at topics for future study. Figs. 3(b) and (d) show some interesting curves that can reasonably be interpreted as reflecting the characteristics of the Kullback-Leibler divergence, a measure for probability distributions. The non-uniformity in the tolerance of the shape parameter , as seen in Fig. 3(c), is a critical factor when making modifications to the coder. The simplest conceivable modification of the coder is adaptively changing the shape parameter frame by frame, with the quantized after it has been estimated (although the smart estimation of for the generalized Gaussian with a non-uniform variance still remains a challenge). In this case, the quantization of should consider the ranges in Fig. 3(c). In contrast, the results in Fig. 3(d) might not seem
, at least promising for this adaptive use of the since for the test data, already achieves about 103% of bit length compared with the optimal length attainable by tuning. However, this graph is only the result for the 16th-order of PAPSE, so we expect that using PAPSE with adaptive in combination with prediction order estimations, [44], [45] for instance, would increase the efficiency of the compression. Finally, we should briefly discuss the computational complexity and the effects of the slight modifications in going from the conventional variance model in eq. (18) to the proposed variance model in eq. (11). Owing to these slight modifications, which were needed to make it possible to use the Levinson-Durbin algorithm, the additional costs of the PAPSE computation only appear in the powering operations in eq. (11), (14), and (30) for the encoder. According to ITU-T BASic OPerators (BASOP) [46], the total increase in computational costs in the encoder, in the condition of the coder stated in the evaluations, amount to approximately 1.2 weighted million operations per second (WMOPS), only about 4% of the total increase, when the is not an integer. Note that this computation still has room for improvement. If we permit a decrease in the precision of the envelope, we can reduce the costs by skipping the calculation of the envelope just at several frequencies, as is done in [30]. Moreover, when , the procedure of PAPSE corresponds to the one in [47], and it is known that the computational complexity is less than in the case of the conventional LP. Meanwhile, we have to emphasize that the proposed PAPSE is not guaranteed to always make the bit length shorter than that of the conventional LP because the model is slightly changed in order to simplify the algorithm. Nevertheless, the results of the experiments proved that these differences had a trivial influence on the PAPSE model. VII. CONCLUSIONS We proposed a coding scheme that efficiently compresses the frequency spectra belonging to the generalized Gaussian distribution with a non-uniform variance over the frequencies. In Section II, we pointed out that the conventional LP is optimal, in the sense of likelihood, only for Gaussian-distributed spectra and that it is not optimal in the case of the state-of-the-art audio coder. In Section III, we presented an alternative variance model, which is a slightly-modified version of the all-pole model, and a method called PAPSE that parameterizes the variance of the generalized Gaussian, the distribution which covers
1318
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
the Laplacian and the Gaussian. This parameterization method shares the same process as the conventional LP and entails only a minor increase in complexity. The experiments presented in Section V confirmed the effects of the generalized Gaussian and of PAPSE on the compression rate. In particular, the objective and subjective assessments of the quality of the state-of-the-art coder showed that PAPSE actually yielded significant enhancements. Therefore, we believe that a generalized Gaussian with a non-uniform variance modeled with PAPSE can be used in a frequency-domain audio coder for mobile communications. Moreover, we would like to apply this method to a lossless audio coder since it simply enhances the compression efficiency. APPENDIX A EQUIVALENCE OF THE SQUARED ERROR AND IS DIVERGENCE Here we represent the spectral envelope
as (32)
The minimization problem of the LP with frame length can be equivalently written, using the Parseval’s theorem, as2
where effects of
is a constant for , it becomes
. Ignoring the edge
(36)
Therefore, if we can prove the two objective functions satisfy
, we can say that (37)
and the equivalence of the problems will be proved. The rest of this section shows the proof for the statement . Note that proving this statement is trivial for continuous signals, since the complex line integral of a logarithmic function around the unit circle is zero. However, when it comes to discrete signals, we have to be a little more careful. Taking into account that are composed of Fourier series, the can be interpreted, when the coefficients envelope are stable, as the eigenvalues of a matrix represented by a circulant matrix:
(33)
.. .. .
as the DFT spectra of a real-valued (Here, we regard signal.) Since where , ignoring the edge effects of , namely assuming , the objective function can be approximated by the variance determined from half of the frame: (34) where we have put it as since it also means the variance of the prediction errors. On the other hand, the IS divergence discussed in this paper can be transformed, by substituting eqs. (4), (32), and (34) into eq. (3), as follows: (35)
..
.
..
.
..
.
(38)
..
.
..
.
..
.
is found from the de-
(39) and as as follows:
Hereinafter, without loss of generality, we regard powers of two. To evaluate , we divide
.. ..
..
.
..
.
. ..
used and here to represent the objective functions of the LP and the IS divergence in order to make it clear that one function corresponds to the Minimum Mean Squared Error (MMSE) criterion and the other to the Maximum Likelihood (ML) criterion.
.
Thus, the geometric mean of terminant of :
.. .
2We
..
.. .
.
.
.
..
.
..
.
..
.
..
.
.. .
.. .
..
. ..
..
.
.
..
.
..
.
.. .
SUGIURA et al.: OPTIMAL CODING OF GENERALIZED-GAUSSIAN-DDISTRIBUTED FREQUENCY SPECTRA
in other words,
The
1319
matrix
(40)
has non-zero elements in the same place as so that the transformation of the determinant from eq. (42) to eq. (44) can be repeated until the matrix in the equation becomes :
First, let us evaluate the upper bound of the determinant. Since holds for positive-semidefinite is a lower triangular matrix with matrices and , and all of its diagonal elements valued one, the determinant satisfies
(45) Finally applying the Hadamard’s inequality to this equation, we get (46)
(41) . Therefore, Next, for the lower bound, we use a theorem on determinants:
and are respectively the -th element of where and the maximum of , resulting in (47) From the above, the determinant must satisfy (48)
(42) and According to the structure of matrix in eq. (42) can be written, using
, the matrix
and since orem leads to
when
, the squeeze the-
(49)
, as
Therefore, the minimization problem of the LP is equivalent is to minimizing the IS divergence when the frame length sufficiently larger than the prediction order . .. .
..
.. ..
APPENDIX B SELECTION OF THE MODEL PARAMETER
.
.
.
.. ..
. .
..
For simplicity, we represent the variance model of PAPSE defined in eq. (11) as , where
.
(50) ..
.
.. .
.. .
..
.. ..
The variance of the distribution to which the quantized spectra eq. (20) belong is written as
. .
.
..
.
..
.
..
.
(43) and we can rewrite the determinant by using it: (44)
(51) Both and the scale take scalar values for each frame, and in the case of a fixed bit rate, if the is large, must also be large, and vice versa. Therefore, we will represent by the ratio with an appropriate constant found from a training data set with a given bit rate and shape parameter . Let us analyze the data and estimate the variance for each frame with the LP model. The frame number is represented by the subscript . We assume that the estimated variance, unless its model parameters are quantized, indicates the
1320
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 8, AUGUST 2015
true variance of the distribution to which the training data belong. The constant for should minimize the expectation of the description length of coding the values belonging to the same distribution as the training data. Thus, we use a description length , which is the expectation of the bit length when we ideally code the values belonging to the same distribution as the spectrum of the th frequency bin in the th frame with a fixed for :
(52) This description length is a convex function of such that the which minimizes the expectation of the description length of the whole training data is given by the stationary point of for :
(53) which is the -power mean of
.
REFERENCES [1] 3GPP TS 26.290 Release 11, 3GPP, 2012. [2] R. Salami, R. Lefebvre, A. Lakaniemi, K. Kontola, S. Bruhn, and A. Taleb, “Extended AMR-WB for high-quality audio on mobile devices,” IEEE Commun. Mag., vol. 44, no. 5, pp. 90–97, May 2006. [3] ISO/IEC 23003-3:2012, Information technology–MPEG audio technologies–Part 3: Unified speech and audio coding. [4] S. Quackenbush, “MPEG unified speech and audio coding,” IEEE Comput. Soc. MultiMedia, vol. 20, no. 2, pp. 72–78, Jun. 2013. [5] M. Neuendorf, M. Multrus, N. Rettelbach, G. Fuchs, J. Robilliard, J. Lecomte, S. Wilde, S. Bayer, S. Disch, C. Helmrich, R. Lefebvre, P. Gournay, B. Bessette, J. Lapierre, K. Kjorling, H. Purnhagen, L. Villemoes, W. Oomen, E. Schuijers, K. Kikuiri, T. Chinen, T. Norimatsu, C. K. Seng, E. Oh, M. Kim, S. Quackenbush, and B. Grill, “MPEG unified speech and audio coding - the ISO/MPEG standard for high-efficiency audio coding of all content types,” in Proc. AES 132nd Conv. Paper, Apr. 2012, vol. #8654. [6] 3GPP TS 26.441 Release 12, 3GPP, 2014.
[7] T. Robinson, SHORTEN: Simple lossless and near-lossless waveform compression Cambridge Univ. Eng. Dept., Cambridge, U.K., Tech. Rep. 156, 1994. [8] M. Hans and R. W. Schafer, “Lossless compression of digital audio,” IEEE Signal Process. Mag., vol. 18, no. 4, pp. 21–32, Jul. 2001. [9] ISO/IEC 14496-3:2009, Information technology–Coding of audio-visual objects–Part 3: Audio. [10] T. Liebchen, Y. Reznik, T. Moriya, and D. T. Yang, “MPEG-4 audio lossless coding,” in Proc. AES 116th Conv., May 2004, vol. #6047. [11] T. Liebchen and Y. Reznik, “MPEG-4ALS: An emerging standard for lossless audio coding,” in Proc. Data Compression Conf., Mar. 2004, pp. 439–448. [12] Y. Reznik, “Coding of prediction residual in MPEG-4 standard for lossless audio coding (MPEG-4 ALS),” in Proc. ICASSP, 2004, pp. III–1024–1027. [13] T. Liebechen, T. Moriya, N. Harada, Y. Kamamoto, and Y. A. Reznik, “The MPEG-4 audio lossless coding (ALS) standard - technology and applications,” in Proc. AES 119th Conv., Oct. 2005, Paper #6589. [14] H. Kameoka, Y. Kamamoto, N. Harada, and T. Moriya, “A linear predictive coding algorithm minimizing the Golomb-Rice code length of the residual signal,” IEICE Trans. Fundam. Electron., vol. J91–A, no. 11, pp. 1017–1025, Nov. 2008, (in Japanese). [15] Y. Kamamoto, “Efficient lossless coding of multichannel signal based on time-space linear predictive model,” Ph.D. dissertation, Graduate School of Inf. Sci. and Technol., Univ. of Tokyo, , Nov. 2012, pp.67–82. [16] S. Salomon and G. Motta, Handbook of Data Compression. New York, NY, USA: Springer, 2010, pp. 279–280–1018–1030. [17] ITU-T G.711.0, “Lossless compression of G.711 pulse code modulation,” 2009. [18] N. Harada, Y. Kamamoto, T. Moriya, Y. Hiwasaki, M. A. Ramalho, L. Netsch, J. Stachurski, L. Miao, H. Taddei, and F. Qi, “Emerging ITU-T standard G.711.0 - lossless compression of G.711 pulse code modulation,” in Proc. ICASSP, 2010, pp. 4658–4661. [19] N. Harada, Y. Kamamoto, and T. Moriya, “Lossless compression of mapped domain linear prediction residual for ITU-T recommendation G.711.0,” in Proc. Data Compression Conf., Mar. 2010, p. 532. [20] S. W. Golomb, “Run-length encodings,” IEEE Trans. Inf. Theory, vol. IT-12, no. 3, pp. 399–401, Jul. 1966. [21] R. F. Rice, “Some practical universal noiseless coding techniques part I-III,” Jet Propulsion Lab. Tech. Rep., vol. JPL-79-22, JPL-83-17, JPL-91-3, 1979, 1983, 1991. [22] K. Sayood, Lossless Compression Handbook. New York, NY, USA: Academic, 2003. [23] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 674–693, Jul. 1989. [24] P. Moulin and J. Liu, “Analysis of multiresolution image denoising schemes using generalized gaussian and complexity priors,” IEEE Trans. Inf. Theory, vol. 45, no. 3, pp. 909–919, Apr. 1990. [25] C. Bouman and K. Sauer, “A generalized Gaussian image model for edge-preserving MAP estimation,” IEEE Trans. Image Process., vol. 2, no. 3, pp. 296–310, Jul. 1993. [26] C. Parisot, M. Antonini, and M. Barlaud, “3d scan based wavelet transform and quality control for video coding,” EURASIP, vol. 2003, no. 1, pp. 521–528, Jan. 2003. [27] R. Yu, X. Lin, S. Rahardja, and C. C. Ko, “A statistics study of the MDCT coefficient distribution for audio,” in Proc. ICME, 2004, pp. 1483–1486. [28] M. Oger, S. Ragot, and M. Antonini, “Transform audio coding with arithmetic-coded scalar quantization and model-based bit allocation,” in Proc. ICASSP, 2007, pp. IV–545–548. [29] G. Fuchs, M. Multrus, M. Neuendorf, and R. Geiger, “MDCT-Based coder for highly adaptive speech and audio coding,” in Proc. EUSIPCO, 2009, pp. 1264–1268. [30] 3GPP TS 26.445 Codec for Enhanced Voice Services (EVS); Detailed algorithmic description Release 12, 3GPP, 2014. [31] R. Sugiura, Y. Kamamoto, N. Harada, H. Kameoka, and T. Moriya, “Golomb-Rice coding optimized via LP for frequency domain audio coder,” in Proc. GlobalSIP, 2014. [32] D. Mansour and B. H. Juang, “The short time modified coherence representation and noisy speech recognition,” in Proc. ICASSP, 1988, vol. 1, pp. 525–528. [33] W. Thanhikam, Y. Kamamori, A. Kawamura, and Y. Iguni, “Stationary and non-stationary wide-band noise reduction using zero phase signal,” IEICE Trans. Fundam. Electron., vol. E95–A, no. 5, pp. 843–852, May 2012.
SUGIURA et al.: OPTIMAL CODING OF GENERALIZED-GAUSSIAN-DDISTRIBUTED FREQUENCY SPECTRA
[34] A. Kawamura, “A restricted impact noise suppressor in zero phase domain,” in Proc. EUSIPCO, 2014, vol. WE-P1-11. [35] D. R. Gonzalez, E. L. Solano, and J. R. Calvo de Lara, “Zero phase speech representation for robust formant tracking,” in Proc. EUSIPCO, 2014, vol. TH-P4-3. [36] T. Bäckström and C. R. Helmrich, “Decorrelated innovative codebooks for ACELP using factorization of autocorrelation matrix,” in Proc. Interspeech, 2014, pp. 2794–2798. [37] H. Ohmuro, T. Moriya, K. Mano, and S. Miki, “Vector quantization of LSP parameters using moving average interframe prediction,” Electron. Commun. Jpn., vol. 77, pp. 12–26, 1994. [38] F. Itakura, “Line spectrum representation of linear predictive coefficients,” J. Acoust. Soc. Amer., vol. 57, no. 1, 1975, Suppl., S35. [39] M. Goto, “Development of the RWC music database,” in Proc. ICA, Apr. 2004, pp. I-553–556. [40] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and synthesis,” Speech Commun., vol. 9, no. 4, pp. 357–363, Aug. 1990. [41] R. Sugiura, Y. Kamamoto, N. Harada, H. Kameoka, and T. Moriya, “Resolution warped spectral representation for low-delay and low-bitrate audio coder,” IEEE Trans. Acoust., Speech, Signal Process., vol. 23, no. 2, pp. 288–299, Feb. 2015. [42] [Online]. Available:http://www-mmsp.ece.mcgill.ca/Documents/Software/Packages/AFsp/AFsp.html(as of Oct. ’14) [43] [Online]. Available: http://www.voiceage.com/AMR-WBplus.html(as of Oct. ’14) [44] Y. Kamamoto, T. Moriya, and N. Harada, “Low-complexity PARCOR coefficient quantizer and prediction order estimator for lossless speech coding,” Proc. ICASSP, pp. 4678–4681, 2010. [45] Y. Kamamoto, T. Moriya, N. Harada, and Y. Hiwasaki, “Low-complexity PARCOR coefficient quantization and prediction order estimation designed for entropy coding of prediction residuals,” Acoust. Sci. Technol., vol. 34, no. 2, pp. 105–112, 2013. [46] ITU-T Software Tool Library, 2009, User’s Manual, Geneva, 2009. [47] T. Moriya, N. Iwakami, K. Ikeda, and S. Miki, “Extension and complexity reduction of TwinVQ audio coder,” in Proc. ICASSP, 1996, vol. 2, pp. 1029–1032. [48] ITU-R BS.1534-1, “Method for the subjective assessment of intermediate quality level of coding systems,” 2001.
Ryosuke Sugiura received the B.S. degree from the University of Tokyo in 2013. In the same year, he entered the Graduate School of Information Science and Technology, the University of Tokyo and become an intern at NTT Communication Science Laboratories. In 2015, he joined the same laboratories of NTT, after receiving the M.S. degree from the University of Tokyo. His research interests include audio coding and signal processing. He is a member of the Acoustical Society of Japan (ASJ) and IEEE.
Yutaka Kamamoto received the B.S. degree in applied physics and physico-informatics from Keio University in 2003 and the M.S. and Ph.D. degrees in information physics and computing from the University of Tokyo in 2005 and 2012, respectively. Since joining NTT Communication Science Laboratories in 2005, he has been studying signal processing and information theory, particularly lossless coding of time-domain signals. He additionally joined NTT Network Innovation Laboratories, where he developed the audio-visual codec for ODS (Other
1321
Digital Stuff/Online Digital Source) from 2009 to 2011. He has contributed to the standardization of coding schemes for MPEG-4 Audio lossless coding (ALS), ITU-T Recommendation G.711.0 Lossless compression of G.711 pulse code modulation, and 3GPP Enhanced Voice Services (EVS). He received the Telecom System Student Award from the Telecommunications Advancement Foundation (TAF) in 2006, the IPSJ Best Paper Award from the Information Processing Society of Japan (IPSJ) in 2006, the Telecom System Encouragement Award from TAF in 2007, and the Awaya Prize Young Researcher’s Award from ASJ in 2011. He is a member of IPSJ, ASJ, the Institute of Electronics, Information and Communication Engineers (IEICE), and IEEE.
Noboru Harada received the B.S. and M.S. degrees from the Department of Computer Science and Systems Engineering of Kyushu Institute of Technology, in 1995 and 1997, respectively. He joined NTT in 1997. His main research area has been lossless audio coding, high-efficiency coding of speech and audio, and their applications. He additionally joined NTT Network Innovation Laboratories, where he developed the audio-visual codec for ODS, from 2009 to 2011. He is an editor of ISO/ IEC 230006:2009 Professional Archival Application Format, ISO/IEC 14496-5:2001/ Amd.10:2007 reference software MPEG-4 ALS and ITU-T G.711.0, and has contributed to 3GPP EVS. He is a member of IEICE, ASJ, the Audio Engineering Society (AES), and IEEE.
Hirokazu Kameoka received the B.E., M.E., and Ph.D. degrees all from the University of Tokyo, Tokyo, Japan, in 2002, 2004, and 2007, respectively. He is currently a Research Scientist at NTT Communication Science Laboratories, Kanagawa, Japan. His research interests include computational auditory scene analysis, acoustic signal processing, speech analysis, and music applications. Dr. Kameoka is a member of IEICE, IPSJ, and ASJ. He was awarded the Yamashita Memorial Research Award by IPSJ, the Best Student Paper Award Finalist at 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2005), the 20th Telecom System Technology Student Award by TAF in 2005, the Itakura Prize Innovative Young Researcher Award by ASJ, 2007 Dean’s Award for Outstanding Student in the Graduate School of Information Science and Technology by the University of Tokyo, 1st IEEE Signal Processing Society Japan Chapter Student Paper Award in 2007, the Awaya Prize Young Researcher Award by ASJ in 2008, the IEEE Signal Processing Society 2008 SPS Young Author Best Paper Award in 2009, and IEICE ISS Young Researcher’s Award in Speech Field in 2011.
Takehiro Moriya received the B.S., M.S., and Ph.D. degrees in mathematical engineering and instrumentation physics from the University of Tokyo, in 1978, 1980, and 1989, respectively. Since joining NTT Laboratories in 1980, he has been engaged in research on medium to low bitrate speech and audio coding. In 1989, he worked at AT&T Bell Laboratories as a Visiting Researcher. Since 1990 he has contributed to the standardization of coding schemes for the Japanese Public Digital Cellular system, ITU-T, ISO/IEC MPEG, and 3GPP. He is currently Head of Moriya Research Laboratory in NTT Communication Science Laboratories, and Fellow of NTT. He is a member of Senior Editorial Board of IEEE Journal of Selected Topics in Signal Processing.