A Bayesian Method for High-frequency Restoration ... - Semantic Scholar

A Bayesian Method for High-frequency Restoration of Low Sample-rate Speech Yunpeng Xu, Changshui Zhang State Key Laboratory of Intelligent Technology and Systems Department of Automation, Tsinghua University, 100084, Beijing, P.R. China [email protected], [email protected]

Abstract. Compared with high sample-rate speeches, low sample-rate speeches lose all high frequency components that outrange the Nyquist frequency, which might severely impair the speeches’ sound effects. To address this problem, this paper proposes a novel High-frequency (HF) restoration method of low sample-rate speech based on Bayesian inference, which turns the restoration problem into a maximizing a posteriori estimation. With this method, the relation between high frequency components and low frequency components is first extracted from the training set. The compatibility between neighboring audio frames is also modelled by a one dimensional Markov Random Field. Then the extracted knowledge is adopted in reconstructing the original high samplerate signal for the testing low sample-rate audio. Experiments prove the applicability and effectiveness of this method.

1

Introduction

Although the high frequency components of a typical speech audio has little power compared with its lower frequency components, they still preserve rich information and determine the speech audio quality to a large extent. This can be proved by the fact that we always describe low sample-rate audio that has few high frequencies as obscure and blur while associating high sample-rate audio that contains abundant high frequencies with clear and bright. In many real audio systems, however, high frequencies are quickly attenuated and suppressed due to various reasons. This usually results in deteriorated sound effects. To solve this problem, many EQ-based mechanisms have been introduced to boost and compensate the high frequencies in the audio industry. While these methods alleviate the problem, they can only be adopted to emphasize existing but attenuated high-frequency content of audio files. The increased need for cross-platform working has posed a new set of problems related to low sample rates. For example, we would expect that low sample rate speech from the telephone can have the same quality as the high sample rate audio of CD. However, this cannot be simply achieved by upsampling[6], which carefully remove all spectral components beyond the input signal bandwidth with a low pass filter. To address this problem, in [2], an excitation algorithm is presented in order to extrapolate the high frequencies that outrange the Nyquist frequency

2

Yunpeng Xu et al.

from the existing lower frequency content. However, lacking the original high sample rate audio, information concerning the high frequencies is not presented by low sample-rate audio. This means that they just guess the high frequency content in a heuristic way. Therefore, artifacts that are irrelevant to the speaker would be inevitable in this system. On the other hand, however, as one basic assumption in speech recognition, the spectrum of a specific speaker’s voice has a relatively stable pattern of composing, which indicates that the high frequency components and the low frequency components in one’s voice are related in a certain way. Therefore, if the relation between high frequency components and low frequency components can be learned from a training set, then the missing high frequency components in a low sample-rate audio can be inferred based on this knowledge. Inspired by the idea, we propose a novel HF restoration method of low sample-rate speech based on Bayesian inference, which turns the restoration problem into a Maximizing a Posteriori (MAP) estimation, and estimate the original high sample-rate speech audio from the training audio. The rest of this paper is organized as follows. Section 2 describes the principles and the algorithm of the proposed method in details. Experimental results are shown in section 3. Finally, the concluding remarks and future research plans are given in section 4.

2

Bayesian Framework for HF Restoration of Low Sample-rate Speech

In this section, we first introduce the Bayesian Framework for HF restoration of low sample-rate speech. The two factors, likelihood and prior, which determine the optimization objective are then analyzed, followed by an algorithm to optimize them. Finally, the feature selection of audio signal is discussed. 2.1

The Bayesian Framework

Consider an observed low sample-rate speech audio L composed by n overlapped frames, i.e., L = {l1 , l2 , ..., ln }, and denote the corresponding high sample-rate speech audio by H = {h1 , h2 , ..., hn }. Then the problem of restoration can be simply described as: infer H with a given L. For a specific speaker’s voice, the low frequency components relate probabilistically to the high frequency components. Such a probabilistic relation can be conveyed by the training audio. A reasonable deduction is that the testing audio of the same person should also preserve this relation. Therefore, the process of inferring H can be comprehended as, given L, to find the optimal series of {h∗1 , h∗2 , ..., h∗n } for H ∗ as the reconstructed speech audio, such that the probability P (H|L) would be maximized, i.e., H ∗ = arg max(P (H|L)) H

(1)

A Bayesian Method for HF Restoration of Low Sample-rate Speech

3

This is a typical problem of MAP estimation. By Bayesian Theorem, it follows that H ∗ = arg max(P (L|H)P (H)) (2) H

where, P (L|H) is called Likelihood, i.e., the probability of a given H producing L; P (H) is called Prior, i.e., the probability of occurrence for the estimated H. 2.2

Likelihood and Prior

The likelihood function P (L|H) describes the probability of a high sample-rate audio producing the corresponding low sample-rate audio. It can be written as P (L|H) =

n Y

p(li |hi )

(3)

i=1

where, p(li |hi ) is the probability of hi producing li and depends on the transform between the two. Intuitively, if a high sample-rate audio frame approximates to the observed low sample-rate frame after downsampling, then the likelihood would approximate to 1, per contra to 0. Therefore, it is desirable to model the transform with a Gaussian probabilistic function, i.e., we define p(li |hi ) as: p(li |hi ) =

1 exp{−||Dhi − li ||2 /σ 2 } Z

(4)

where, D is a downsampling operator, || · || is a certain distance measure to describe the difference between Dhi and li , σ 2 is the variance, and Z is a normalization constant. We denote {||Dhi −li ||2 /σ 2 } by φ(li , hi ). Therefore, P (L|H) can be writen as: P (L|H) = =

1 Z 1 Z

exp{−Φ(L, H)} = exp{−

n P i=1

1 Z

exp{−

||Dhi − li ||2 /σ 2 }

n P i=1

φ(li , hi )} (5)

Formula (4) suggests a straightforward nearest neighbor algorithm for this task. For each low sample-rate frame li (i ∈ 1, 2, ..., n), we search in the training set Htrain for the high sample-rate frame htrain,j (j ∈ 1, 2, ..., m) which can best approximate it after downsampling. htrain,j is then used to replace li as the restored high sample-rate frame hi . It should be emphasized that, in practice, to ensure the low frequency component hli of hi unchanged after upsampling, only the high frequency component hhtrain,j of htrain,j is used for replacement and is taken as the restored high frequency component hhi of hi . Then hli , which is generated by interpolation from li , is added to hhtrain,j to generate hi , i.e., hi = hhtrain,j + hli . However, this simple method cannot preserve the smooth connection at frame joints because it ignores the consistency of neighboring frames. In fact, the local frame information alone is insufficient for HF restoration, which indicates

4

Yunpeng Xu et al.

that neighboring primitives must be taken into consideration. Therefore, we propose to model the HF restoration of low sample-rate speech problem with a one-dimensional Markov Random Field (MRF), so as to properly represent the compatibility between frames in a non-parametric way. The MRF model for HF restoration is shown in Figure 1. Each node in the figure denotes an audio frame. Here, we let the low sample-rate frames be observation nodes, L, while the high sample-rate frames be hidden nodes, H. The lines indicate statistical dependencies between nodes, where function φ is the likelihood energy function defined above, and function ψ describes the compatibility of each two neighboring high sample-rate frames. Similarly, two neighboring frames are more compatible if they agree better in the overlap region, then the prior potential takes small value consequently. Therefore, we define the function ψ as: ψ(hi , hi+1 ) = ||hi − hi+1 ||2hi ∩hi+1 (6)

Fig. 1. Illustration of the MRF model for HF restoration of low sample-rate speech.

Notice that if each frame is taken as a state and the transition probability between states is defined as a function of ψ, then this model is equivalent to a HMM, which is broadly used in audio modelling. The MRF model for HF restoration of low sample-rate speech also facilitates the computation of the frame prior. By Hammersley-Clifford Theorem, each MRF has a joint probability in Gibbs form[1]. This is so-called Markov-Gibbs Equation. Therefore, the prior P (H) can be expressed as: P (H) =

1 exp{−Ψ (H)/T } ZH

(7)

where, ZH is a normalization constant, T is a control parameter. If only oneorder neighborhood is considered, then Ψ (H) can be expressed as the following function of the prior potential: X Ψ (H) = ψ(hi , hi+1 ) (8) i∈{1,2,...,n−1}

Hence, P (Hhi ) =

1 exp{ ZH

X

i∈{1,2,...,n−1}

||hi − hi+1 ||2hi ∩hi+1 /T }

(9)


5

To summarize, with the MRF model for HF restoration and HammersleyClifford Theorem, we decompose the complex computation of the joint prior probability into the computation of the local prior potentials ψ(hi , hi+1 ), (i ∈ 1, 2, ..., n − 1). Then the constraint of consistency and compatibility between frames is also guaranteed in this way. 2.3

Posteriori and its Computation

Combine formulae (2)(5)(7), it follows the expression for the posteriori P (H|L), i.e., P (H|L) =

1 Z×ZH

exp(−(Φ(L, H) + Ψ (H)/T ))

Then the optimization objective is equivalent to H ∗ = arg min(Φ(L, H) + Ψ (H)/T ) H

(10)

Therefore, in the Bayesian framework, the HF restoration of low sample-rate speech can be realized through 4 steps, as shown in table 1. Table 1. Flow chart for HF restoration of low sample-rate speech in the Bayesian framework 1. Divide the training audio Htrain and the testing audio L into overlapped frames; h from the training audio Htrain ; 2. Separate the high frequency components Htrain 3. Upsampling the testing audio L by interpolation method to get the low frequency components H l of the high sample-rate output; h for the optimal combination H h∗ of hhtrain,j (j = 1, 2, ..., m), such 4. Search in Htrain that the formula (10) can be minimized by H ∗ , which is the sum of H l and H h∗ . Then H ∗ is the result of restoration.

The 4th step of the flow compares {Φ(L, H) + Ψ (H)/T } for each possible combination of hhtrain,j . However, the complexity of this computation increase exponentially as the number of frames grows. To find a good tradeoff between efficiency and effect, we adopt a one−pass algorithm[4], which only searches for the frame that can best match the previously selected high frequency frame and the current testing frame. We find that the one-pass algorithm is of satisfying quality and utility for this problem, as it can both give good results and be performed in real-time. 2.4

Audio Feature

To measure the difference between two audio frames for matching computation, we need to specify the distance measure in formula (4). Here, Euclidean distance measure is adopted. However, it is inappropriate to measure the Euclidean distance directly using the samples in each frame due to the possible phase shifts

6

Yunpeng Xu et al.

and large sample size. Therefore, it is desirable to extract features from each frame and use them to compute the distance instead of using samples directly. In this paper, we adopt the features of MFCCs. MFCCs features are widely used in the field of speech recognition. They are proved to be very effective in modelling the spectrum magnitude of audio signals. The extraction of these features takes into account the human auditory characteristics by adopting filter banks and transforms that are similar to human auditory systems. A more detailed introduction of MFCCs is presented in [5].

3

Experiments and Results

In this section, experiments on human speech are presented to test the above HF restoration method. In the experiments, we record the speech of one male speaker in a common meeting room with no other sound source. To improve the speech audio quality, we filter the audio file with a denoiser using the CoolEdit software. We also carefully remove all continuous blank frames that exceed 0.5s in length to make the speech audio more compact. This results in a 10 minutes speech audio pool with little backgrounds noise. The format of the original speech audio is 48kHz, 16bits and mono-channel. From the audio pool, we randomly select 4 minutes continuous speech as the training audio, and a distinct 20 seconds continuous speech as the testing audio. Testing audio is then downsampled to 6kHz. Both the training audio and the testing audio are divided into 20ms frames with 5ms hop-size. The choice of parameter T and σ in formula (4) and (7) is empirically dependent. In fact, we can set one parameter to be 1 and estimate another. In the experiments below, we set σ to be 1 and estimate T by a simple heuristic search, which is set to be 0.15 and proves suitable for this problem. Figure 2(a) shows the restored high sample-rate audio (middle) for the testing set in the first 2 seconds in time domain, in contrast with the original audio (above) and the upsampling result (below). Here, upsampling method is used as a comparison model to produce results that preserve exactly the same sound effects as the input low sample-rate speech. Therefore, comparisons can be made at the same sample-rate. Compared with the original audio, the upsampling result loses many fine details in that some small lumps and spindle structures of the original audio degenerate to lines. Such lumps and spindle structures are related to the most rapidly fluctuant components of the audio wave and therefore correspond to the audio high frequencies. While the HF restored speech by our method recovers these details quite well, which means the matching mechanism is quite effective in retrieving the high frequencies. In addition, the spectral configurations of these three audios are compared in Figure 2(b). It is obvious that the high frequencies beyond 3kHz rapidly decay in upsampling audio, while the spectra of our restored audio is of approximately the same configuration as that of the original audio. This provides a more convincing proof that the proposed method can well capture the relation between high frequency components and low frequency components of the speaker’s voice,


7

instead of guessing the high frequency content in a heuristic way. Therefore, artifacts that are irrelevant to the speaker can be safely avoided.

Amplitude

Original Audio

Restored Audio

Upsampling Audio

Time (s)

(a) −50 Upsampling Audio Restored Audio Original Audio

−60

Amplitude (dB)

−70

−80

−90

−100

−110

−120

−130

0

0.5

1

1.5

Frequency (Hz)

2

2.5 4

x 10

(b) Fig. 2. Comparative Experiments Results. (a). Comparison in time domain. (b). Comparison in frequency domain. increases

To evaluate the quality of the HF restored audio, we adopt the MSE criterion that is frequently used in Speech Enhancement [3]. This criterion computes the mean square error (MSE) between the logarithms of the spectra of the original and estimated signals. Generally, this criterion is believed to be correlated with the quality of the speech signal and more perceptually meaningful than the MSE between the original and estimated signal waveforms. We compute MSE value

8

Yunpeng Xu et al.

for each frame and average over all frames. For validation consideration, we also average over 10 runs with randomly selected 4 minute continuous training set and 20 seconds continuous testing set. This produces the MSE values for the upsampling audio and our HF restored audio, which are 0.2641 and 0.1862 respectively. This result indicates that the quality of the restored speech does improve. Furthermore, the experience of human auditory test supports the same conclusion, i.e., compared with the upsampling audio, the restored audio by our method sounds more clear and bright with less blur and obscureness that exist in the low sample-rate audio. Experiments also reveal that, the HF restored results are closely related to the length of the training set. For evaluation, training audio with different sizes are taken to compute the MSE values between output and the original audio. Experimental results are given in Figure 3. It shows that when the length of training set is small, MSE takes large values, while it decreases rapidly as the length increases. When the length exceeds 80 seconds, the curve levels off except some fluctuations within a small range. This also explains the reason why we choose a small training set: 80 seconds would possibly be a sufficient length for this task. However, much more experiments are needed to draw a more confirmative conclusion. 0.25

0.24

0.23

MSE

0.22

0.21

0.2

0.19

0.18

0

50

80

100

150

Training Set Length (s)

200

250

Fig. 3. The MSE value decreases as the length of training set increases.

4

Conclusion and Future Work

This paper proposes a novel High-frequency restoration method for low samplerate speech based on Bayesian inference. With this method, the problem of HF restoration is turned into a MAP estimation, by which the original high samplerate audio can be estimated as the optimal solution to formula (2). Compared


9

with the upsampling methods, this method can properly reconstruct the high frequency components of the original audio instead of introducing artifacts that are irrelevant to the speaker. The experimental results demonstrate the validity and effectiveness of this method. Admittedly, the current method is not perfect yet. Although the quality of restored audio is better than that of the upsampling result, there are still much to improve compared with that of the original audio. To further improve the method, our future work would focus on selecting a combination of features that can better reflect the characteristic of the original audio. It also makes sense to test our method on other kinds of audio data such as music, etc.

Acknowledgements We would like to acknowledge Yungang Zhang, Yangqiu Song, Yonglei Zhou for their helpful discussions. We also gratefully thank the the anonymous reviewers for their valuable comments.

References 1. P. Braud, Markov chains: Gibbs fields, Monte Carlo simulation and queues, Springer, NY, 1999. 2. C. I. Cheng, “High-frequency compensation of low sample-rate audio files: A wavelet-based spectral excitation algorithm”, Proc. of International Computer Music Conference, Sept. 1997. 3. Y. Ephraim, H.L. Ari, W.J.J. Roberts,“A brief survey of speech enhancement”, The Electronic Handbook, CRC Press, 2003. 4. W. T. Freeman, T. R.Jones, E. C. Pasztor, “Example-based super-resolution”, IEEE Trans. on Computer Graphics and applications, Vol. 22, Issue:2, pp.56-65, MarchApril, 2002. 5. L. Rabiner, B. H. Juang, Fundamentals of speech recognition, Prentice-Hall, 1993. 6. M. Vetterli, “A theory of multirate filter banks”, IEEE Trans. on Acoustics, Speech, and Signal Processing , Vol.35, No. 3, pp.356 - 372, March 1987.