fast sola-based time scale modification using modified ... - CiteSeerX

0 downloads 0 Views 184KB Size Report
1. INTRODUCTION. Time scale modification (TSM) [1-8] is a class of algorithms to change the time scale of a signal without ..... 374-390, Jun 1981. [3] D.W. Griffin, J.S. Lim, "Signal Estimation from. Modified Short Time Fourier Transform", IEEE.
FAST SOLA-BASED TIME SCALE MODIFICATION USING MODIFIED ENVELOPE MATCHING Peter H. W. Wong*, Oscar C. Au** Department of Electrical and Electronic Engineering The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong. Email: [email protected]*, [email protected]**, Tel.: +852 2358-7053** ABSTRACT Time scale modification (TSM) of speech and audio is useful in many applications. Synchronized Overlap-andAdd (SOLA) is a time-domain TSM algorithm known to achieve good speech and audio quality. One problem of SOLA is that it requires a large amount of computation. In this paper, we propose a technique called Modified Envelope-Matching TSM (MEM-TSM) to simplify the computation. The MEM-TSM improves the previously proposed EM-TSM in quality and speed up factor. The proposed algorithm can reduce computation by 300 times with good perceptual quality of time-scaled speech and audio. Our experimental results show that the quality of MEM-TSM is almost the same as SOLA. 1. INTRODUCTION Time scale modification (TSM) [1-8] is a class of algorithms to change the time scale of a signal without changing the perceptual characteristics of the signal. There is an associated parameter, called the time scaling factor α. When α is one, the signal is unchanged. When α is greater than one, the signal is time expanded. When α is less than one, the signal is time compressed. A straightforward way to achieve TSM is simple interpolation followed by decimation. But this would introduce undesirable pitch shifting causing human speech to be heavily distorted. As a result, more intelligent methods are needed, such as overlap-and-add (OLA) and synchronized OLA (SOLA)[6]. In OLA, the signals are divided into frames and adjacent frames are overlapped, weighted and added together. SOLA improves over OLA by synchronizing the adjacent frames at the optimal overlapping points. One weakness of SOLA is the high computational cost to search for the optimal overlapping point. We previously proposed a technique called Envelope Matching Time Scale Modification (EM-TSM) [8] which reduces significantly the computational complexity of SOLA. Instead of using the full-precision signals, EM-TSM uses the 1-bit envelope function which is essentially the 1-bit quantized version of the signals. The subjective quality of EM-TSM was good while achieving

a reasonably high speed-up. However, there is still room for perceptual quality improvement and computation reduction. In this paper, we reduce the computation of EM-TSM further by introducing zero crossing point reduction and predictive search. We further improve the quality of the time-scaled signals by introducing multiple candidates re-examination, and frame size modification. 2. REVIEW OF SOLA In SOLA, the input (or analysis) signal x[n] is segmented into overlapping frames of length N that are a distance of Sa apart. The first frame is directly copied to the output (or synthesized) signal y[n]. Then, for any m>0, the (m+1)th frame which starts at mSa is shifted along the synthesized signal y[n] around the target location mSs within the range of [kmin, kmax] to find a location which maximizes the cross-correlation function, R'[k], defined in eqn (1), where Lk is the length of the overlapping region between the shifted analysis frame and synthesized signal. With the optimal location found, the overlapping region is cross-faded and the rest of the analysis frame, if any, is directly copied to the synthesized signal. Lk −1

∑ y[mS s + k + i ] ⋅ x[mS a + i]

R' [k ] =

i =0

1

(1)

Lk −1  2 y 2 [mS s + k + i] ⋅ ∑ x 2 [mS a + i ] ∑ i = i = 0 0   Lk −1

The Sa and Ss are the analysis and synthesis frame period respectively, related by Ss = α Sa, where α is the time scaling factor. Usually kmin and kmax are set to be –N/2 and N/2 respectively. The Lk can take on any integer between 0 and N and its value depends on the lag k and the time-varying length of the synthesized signal. 3. REVIEW OF EM-TSM The Envelope Matching Function In [8], we propose to use the 1-bit envelope functions instead of the original full precision signals in the search to reduce the computational requirement in SOLA. The envelope functions are defined in Eqn. 2, which are effectively the 1-bit quantized versions of the 3.1

signals. The corresponding normalized cross-correlation function, R[k], which we called the envelope-matching function (EMF) is shown as Eqn (2). Lk −1

∑ sign{y[mSs + i + k ]}⋅ sign{x[mSa + i ]} i =0

R[ k ] =

Lk

(2)

1, x ≥ 0 (3) sign ( x ) =   − 1, x < 0

The EMF is much simpler than the original normalized cross correlation using full precision. 3.2

The Simplified Envelope Matching Function The first computation reduction measure for EM-TSM is a simplified but equivalent expression of the envelope-matching function. R[k ] =

βk 

2 Lk 

M k + N k − 2 rk



j =1

 ( −1 ) j +1 ck , j + ( −1 )M K + N k Lk  (4) 

where β k = sign{y[mS s + k ]}⋅ sign{x[mSa ]} In Eqn. (4), ck,j is the j-th zero-crossing location in the overlapping region, not counting the locations at which a analysis frame zero-crossing point coincides with a synthesis frame zero-crossing point. Mk is the number of zero-crossing points in the analysis frame within the overlapping region. Nk is the number of zero crossing points in the systhesis frame within overlapping region. rk is the number of coincided zero crossing points. Notice that Eqn. 4 is considerably simpler than Eqn. 2. In Eqn. 2, the required computation of R[k] for a particular k includes Lk 1-bit multiplication, (Lk-1) additions and one full-precision division. In Eqn. 4, it is reduced to M k + N k − 2rk additions, one bit shifting (multiplication by 2), one conditional sign change (multiplication by β k ) and one full-precision division. In other words, all the Lk 1-bit multiplication are eliminated, and the additions are reduced significantly because M k + N k − 2rk is usually significantly smaller than (Lk-1). Here is an example to illustrate the advantage of Eqn. 4. In the example, implementation using Eqn. 2 requires 48 1-bit multiplication, 47 additions and one division. Implementation using Eqn. 4 requires 12 additions, one division and one shifting. R[k ] = [10 − (16 − 10) + (22 − 16) − (24 − 22) + (36 − 24) − (42 − 36) + (48 − 36)] / 48 = [2(10 − 16 + 22 − 24 + 36 − 42) + 48] / 48 Analysis frame

Synthsized signal

-

+ 10

16

+

30

-

+

-

+ 22

-

+ 24

42

30

48 +

36

48

3.3 Recursive Property and Search Point Reduction Although Eqn. 4 was computationally simple, the complexity of EM-TSM was further reduced since R[k] can be computed recursively. In that case, Eqn. 4 was applied to compute only some k while a considerably simpler equation was used for the other k. 4. PROPOSED MODIFIED EM-TSM (MEM-TSM) 4.1 Zero crossing points reduction The speed-up factor of EMF-TSM depends on the amount of zero crossing points of the signal. If high frequency components or noise exist in the signal, the signal will have a large amount of zero-crossing points requiring a large amount of computation even in EM-TSM. As the low frequency components tend to dominate the search results, the high frequency components tend to be less important in TSM. As a result, we propose a simple and efficient method to reduce the high frequency components to reduce zero-crossing points and consequently the overall computation without significant perceptual degradation. If the distance between any two adjacent zero-crossing points in the analysis frame is less than a pre-defined threshold T, the pair is removed. Similarly, pairs of adjacent points closer than a threshold are removed from the systhesis frame. Obviously, a larger threshold results in larger speed-up factor of EM-TSM with larger distortion. With the premise to synthesize high-quality time-scaled signal, we found in our experiments that good threshold values depend on the sampling frequency. For signals sampled at 8192Hz, a good threshold value T may be 8 while it can be up to 16 if the sampling frequency is 44100Hz. 4.2 Predictive SOLA Since there are Sa samples between two consecutive frames, the initial part of the current analysis frame may be embedded in the synthesized signal. The embedded signal is highly correlated with the initial part of the current frame. We propose to skip the searching for the m-th frame if Eqn. 5, 6 and 7 are satisfied.

l' = lag

m −1

+ Sa − Sa (5)

kmin ≤ l' ≤ kmax

(6)

Lk −1

∑ y [ (m − 1)Ss + lag m −1 + i ] ⋅ x [ (m − 1)Sa + i ] i =0

1

≥ T2 (7)

Lk −1 2  Lk −1 2 2  ∑ y [ (m − 1)S s + lag m −1 + i ] ⋅ ∑ x [ (m − 1)Sa + i ]  i =0   i = 0

where lagm-1 is the optimal lag of the (m-1)-th frame and T2 is the average normalized cross-correlation between the previous ten analysis frames and the synthesis frames.

4.3

Multiple candidate re-examination We improve the quality of EM-TSM by multiple candidate re-examination. We re-examine the best M candidates from EM-TSM using the sub-sampled normalized cross-correlation function defined in Eqn. 8 where q is the sub-sampling factor. We use a half-way stop strategy. The re-examination is stopped if the sub-sampled normalized cross-correlation of a candidate is greater than a threshold T2. The T2 is the same threshold used in the optimal lag prediction process in section 4.2. ( Lk −1) / q 

∑ y [ (m − 1)Ss + lag + q ⋅ i ] ⋅ x [ (m − 1)Sa + q ⋅ i ] i =0

 (Lk −1) / q   

∑ y 2 [ (m − 1)Ss + lag + q ⋅ i ] ⋅ i =0

(Lk −1) / q 

1

(8)

2 ∑ x2 [ (m − 1)Sa + q ⋅ i ]  i =0 

4.4 Frame Size Modification In SOLA, the frame size is fixed. If the optimal lag of the current frame is negative, the length of the synthesized signal is shorter than the target length resulting in undesirable time scale distortion. Instead, we modify the frame size dynamically in MEM-TSM. For each frame, if the optimal lag is negative, we increase the frame size by the absolute value of the optimal lag. If the optimal lag is positive, we shorten the frame by the value of the optimal lag. This can eliminate time scale distortion since the length of the synthesized signal does not vary with the optimal lag. In SOLA, if the optimal lag of a frame is negative, the chance for the next analysis frame to find a good match is low, because the cross faded region is not highly correlated with the next frame. Increasing the length of this frame can increase the chance for the next analysis frame to have a good match. 5. OBJECTIVE QUALITY MEASUREMENT To evaluate the quality of the output (or synthesized) signal, objective methods were used. We use Average Normalized Cross Correlation (ANC) defined as ANC = L m −1

1 M

∑ y (m × Ss + lag m + i ) ⋅ x (m × Sa + i )

M



m =1

i =0

L m −1

L m −1

i=0

i =0

2 2 ∑ [ y (m × Ss + lag m + i )] ⋅ ∑ [x (m × Sa + i )]

( 11)

where lagm is the optimal k that maximizes the R[k], Lm is the length of overlapping region corresponding to lagm and M is the number of frames of the signal. 6. EXPERIMENTAL RESULTS AND DISCUSSION The proposed MEM-TSM is simulated together with SOLA, EM-TSM and a fast SOLA using FFT on a personal computer under the Windows2000 environment. Two test signals are used. The first one called 'Song1' is a

120-second classical music signal, recorded directly from a Compact Disc (CD) at a sampling frequency of 44100Hz and a resolution of 16 bits. The second one called 'Song2' is a 127-second mono song sung by a male singer, recorded directly from a CD at 22050Hz and 8 bit resolution. The simulation is performed for 14 time scaling factors (0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1.125, 1.25, 1.5, 1.625, 1.75, 1.875 and 2.0). The quality of the time scaled signals is found to vary with the parameters: the frame size N, the analysis frame period Sa and the synthesized frame period Ss. Since Sa = α Ss, only N and Sa can be adjusted in the simulation. Recall that the analysis frame is shifted along the synthesized signal in order to find the optimal lag. If it is shifted to a position with large lag, the overlapping region may be too short and this is not desirable. To prevent this, we set the minimum length of the overlapping region to be N/4. In our experiments, N is 20ms long and Sa = round(N/4α). The ANC with α=1.5 for ‘Song1’ is shown in Fig. 1. The corresponding speed up factor is shown in Fig. 2. A speed-up factor of up to 350 compared with SOLA can be achieved by MEM-TSM with less than 0.02 losses in ANC. The ANC, operations per second and the speed-up factors for “Song1” and “Song2” are shown in Tables 1 and 2 respectively. The ANC of SOLA is high as expected, being 0.951 and 0.976 for “Song1” and “Song2”, while its speed is slow. The ANC of EM-TSM are significantly lower, only 0.453 and 0.646 for ‘Song1’ and ‘Song2’ respectively. The proposed MEM-TSM improves the ANC of EM-TSM to more than 0.94 in both cases, close to that of SOLA. The improvement in ANC is mainly contributed by multiple candidate re-examination. The speed-up factors for EM-TSM are 79.2 and 33.5 for ‘Song1’ and ‘Song2’ respectively. MEM-TSM improves the speed up factor of EM-TSM by more than 4 times. The improvement is mainly contributed by zero crossing reduction. The speed up factors of MEM-TSM in ‘Song2’ is smaller than in ‘Song1’ since the sampling frequency is smaller. 7. CONCLUSIONS In this paper, a technique called Modified Envelope Match Time Scale Modification (MEM-TSM) is proposed as a fast time scale modification method using the SOLA approach. It uses the 1-bit envelope function and incorporates several novel ways to reduce computation and improve quality over the existing EM-TSM algorithm. MEM-TSM can achieve a speed-up factor of up to 350 with good perceptual quality. The reduction in computation complexity allows the time scale modification of signals to be implemented practically by a single-processor machine in multimedia applications.

Figure 1: ANC for α=1.5, dash line-SOLA, circle-EMF-T=4, q=1, cross-EMF-T=4, q=5, plus-EMF-T=4, q=11, star-EMF-T=14, q=1, square-EMF-T=14, q=5, diamond-EMF-T=14, q=11.

Figure 2 Speed up factor for α=1.5, circle-EMF-T=4, q=1, cross-EMF-T=4, q=5, plus-EMF-T=4, q=11, star-EMF-T=14, q=1, square-EMF-T=14, q=5, diamond-EMF-T=14, q=11. Method ANC Oper./s Speed up SOLA 0.951 1.02E+09 1 SOLA (FFT) 0.951 3.07E+08 3.3 EM-TSM 0.453 1.29E+07 79.2 MEM-TSM T=0, 0.953 2.12E+07 48.0 M=40, q=15 MEM-TSM T=8, 0.950 6.83E+06 149.0 M=12, q=11 MEM-TSM T=14, 0.943 3.22E+06 316.2 M=4, q=15 Table 1 MSD, ANC, operation per second and speed-up factor for MEM-TSM, SOLA, EM-TSM and fast SOLA using FFT for ‘Song1’ with α=1.5

REFERENCE [1] David Malah, "Time Domain Algorithms for Harmonic Bandwidth Reduction and Time Scaling of Speech Signals", IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 27, No. 2, Apr. 1979. [2] M.R. Portnoff, "Time Scale Modification of Speech based on Short Time Fourier Analysis", IEEE Trans. on Acoustics, Speech, Signal Processing, Vol. 9, pp. 374-390, Jun 1981. [3] D.W. Griffin, J.S. Lim, "Signal Estimation from Modified Short Time Fourier Transform", IEEE Trans. Acoustics, Speech, Signal Processing, Vol. ASSP_32, pp. 236-243, Apr. 1984. [4] Quatieri, R.S. McAulay, "Speech Transformation Based on a Sinusoidal Representation", Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Vol. 1, pp. 489-492, Mar. 1985. [5] T. F. Quatieri, R. J. McAulay, "Shape Invariant Time-Scale and Pitch Modification of Speech", IEEE Transactions on Signal Processing, vol. 40, no. 3, p.497-510, March 1992. [6] S. Roucos, A.M. Wilgus, ``High Quality Time Scale Modification for Speech'', Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Vol. 1, pp. 493-496, 1985. [7] J. Laroche, "Autocorrelation Method for HighQuality Time/Pitch-Scaling”, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 131-134, 1993. [8] J.W.C. Wong, O.C. Au, P.H.W. Wong, "Fast Time Scale Modification using Envelope-Matching Technique (EM-TSM)", Proc. of IEEE Int. Sym. on Circuits & Systems, Vol. 5, pp. 550-553, May 98. Method ANC Oper./s Speed up SOLA 0.976 2.58E+08 1 SOLA (FFT) 0.976 1.42E+08 1.8 EM-TSM 0.646 1.29E+07 33.5 MEM-TSM T=0, 0.976 8.37E+06 31.0 M=44, q=13 MEM-TSM T=4, 0.975 3.02E+06 86.1 M=16, q=15 MEM-TSM T=14, 0.968 1.60E+06 162.6 M=8, q=15 Table 2 MSD, ANC, operation per second and speed-up factor for MEM-TSM, SOLA, EM-TSM and fast SOLA using FFT for ‘Song2’ with α=1.5

Suggest Documents