Proceedings of the 8th WSEAS International Conference on SIGNAL PROCESSING, ROBOTICS and AUTOMATION 304
ISSN: 1790-5117
ISBN: 978-960-474-054-3
Bandwidth Extension for Mixed Asynchronous Synchronous Speech Transmission J¨urgen Freudenberger HTWG Konstanz, University of Applied Sciences Department of Computer Science Brauneggerstrae 55, 78462 Konstanz Germany
[email protected] Abstract: Increasing the bandwidth of speech signals from the classical telephone bandwidth of 300-3400 Hz to the wider bandwidth of 50-7000 Hz results in increased intelligibility and naturalness. This work presents a low complexity and low latency bandwidth extension approach for speech coding that is suitable for mixed asynchronous and synchronous transmission. At the encoder, the speech signal is sampled at a rate of 16 kHz and decomposed into two subbands of equal bandwidth. The lower subband signal is transmitted using a synchronous connection oriented link. The signal of the higher subband is analyzed resulting in signal parameters that are transmitted asynchronously. This technique can be used with communication links that provide simultaneous data and narrow band speech transmission as for example with Bluetooth. Key–Words: speech coding, bandwidth extension, wideband speech transmission, mixed asynchronous synchronous transmission, Bluetooth
1
Introduction
Most speech coding systems in use today are based on a narrowband speech bandwidth, nominally limited to about 300-3400 Hz and sampled at a rate of 8 kHz. The Adaptive Multi-Rate - Wideband (AMRWB) speech codec [3] and the ITU-T recommendation G.729.1 [2] are new speech coding standards which enable wideband audio transmission with a bandwidth in the range 50 Hz - 7 kHz. This extension of the classical narrowband signal will significantly improve the speech quality transmission in future telecommunication applications. Increasing the bandwidth of the transmitted speech signals results in increased intelligibility and naturalness of speech. The publications [7, 4, 5] provide a closer look at today’s wideband speech and audio coding standards. This paper considers wideband speech transmission where the subscriber uses a hands-free car kit or a headset. Originally introduced as optional features connected by a wire to mobile phones, handsfree devices are now generally available with wireless technology. Typically they use Bluetooth as wireless link to the mobile phone. However, the speech transmission according to the current Bluetooth standard does not support the new wideband technology. Currently, the speech signal is transmitted as a 64kbps PCM coded signal without additional speech coding. The bandwidth is therefore limited to the narrowband
frequency range. Speech coding is omitted to reduce the overall transmission latency and the complexity of the Bluetooth device. Moreover, transcoding of lossy speech codecs is avoided, which almost always introduces generation loss. The PCM signal provides a high quality input signal for the speech codec of the mobile network. To overcame the problem of the missing wideband support for the Bluetooth link the ITU-T Focus Group on From/In/To Cars Communication II has suggested the introduction of the ITU-T G. 722 standard for Bluetooth speech transmission [6]. G.722 is the 64 kbit/s ITU-T standard for wideband applications which was recommended in 1988 [1]. It is essentially a two-subband coder with ADPCM coding of each subband. At the encoder, the speech signal is sampled at a rate of 16 kHz and decomposed into two subbands of equal bandwidth. Each subband signal is downsampled by a factor of two prior to encoding. The subband decomposition is performed by using Quadrature Mirror Filters (QMF) with finite impulse response. For the G.722 standard, the QMF have 24 coefficients. This configuration introduces a total delay of 3 ms. The two subband signals are then ADPCM coded and quantized, using different resolutions, e.g. 6 bits/sample for the lower band and 2 bits/sample for the higher band. The G.722 codec has low delay and a low complexity. On the
Proceedings of the 8th WSEAS International Conference on SIGNAL PROCESSING, ROBOTICS and AUTOMATION 305
ISSN: 1790-5117
ISBN: 978-960-474-054-3
xlow (k) LP
x′low (k) SCO
2
x(k) xmirror (k)
xhigh (k) HP
power analysis
ACL
LPC analysis
ACL
2
xmod (k) Figure 1: Encoder structure. other hand, it is not backward compatible to the current Bluetooth speech transmission. In this paper we present a low complexity speech coding scheme similar to the ITU-T G. 722 standard that also works on a sample basis to achieve a low latency. We propose to use the standard synchronous connection oriented speech transmission link to transmit the lowpass part of the speech signal (50 Hz - 4 kHz). In addition, some side information is transmitted via an asynchronous connection link. This information can be used at the receiver side to reconstruct the highpass part of the speech signal (4 kHz- 8kHz) from the lowpass signal. This approach is backward compatible to the present Bluetooth standard. Furthermore, it also avoids transcoding for the lowpass part of the speech signal. Our approach is therefore a hierarchical speech codec, where the asynchronously transmitted side information improves the audio fidelity at the receiver. In the next section we briefly discuss the different available Bluetooth link types. In section 3 and section 4 we present the encoding and decoding of our bandwidth extension approach, respectively. Some results and concluding comments are given in section 7.
2
Bluetooth link types
The Bluetooth standard specifies two types of physical links: Synchronous connection oriented (SCO) and asynchronous connectionless (SCO). The SCO link is a symmetric point-to-point link between two Bluetooth devices. It is a circuit switched connection that does not provide re-transmission of corrupted data packets. Usually the SCO link is used for speech transmission. The ACL link is a packet switched connection type. Retransmission is issued to ensure data integrity. The Version 1.2 of the Bluetooth standard introduced Extended Synchronous Connections (eSCO), which improve voice quality of audio
links by allowing retransmissions of corrupted packets. Furthermore, there is a packet type that mixes synchronous and asynchronous data. With this packet type a payload of 10 bytes of synchronous data is transmitted with each packet. Optionally, a payload of variable size (0-9 bytes) can be added. This transmission mode is well suited for the proposed transmission scheme that will be explained in more detail in the next section.
3
Encoding
The basic encoder structure as depicted in Fig. 1 is a split band structure similar to ITU-T Rec. G.729.1 [4]. The analog speech signal x(t) is sampled with a sampling rate of fs = 16 kHz resulting in the discrete wideband signal x(k) with a bandwidth of 7 kHz. The encoder of our speech codec splits the input signal x(k) into two subband signals xlow (k) and xhigh (k). The signal xlow (k) is obtained by lowpass filtering the input with a cutoff frequency of 3.6 kHz. The lowpass signal xlow (k) is sampled down to the Bluetooth sampling rate fs0 = 8 kHz. The down sampled signal x0low (k) is transmitted without further coding using the SCO link of the Bluetooth connection. The signal xhigh (k) is obtained by highpass filtering with a cutoff frequency of 4.4 kHz. This signal is further processed to calculate the side information that is transmitted via the ACL link. Note that for the corner frequencies specified in this section as well as for the simulation results in the subsequent sections we always assumed IIR filters of order 6. As we will see later, the filter order mainly determines the overall complexity of the encoding and decoding. It also determines the latency of the encoding and decoding. With filter order 6 the delay for encoding/decoding is less than 1ms. The encoder is based on a Quadrature Mirror Filter (QMF) similar to the structure of the G.722 stan-
Proceedings of the 8th WSEAS International Conference on SIGNAL PROCESSING, ROBOTICS and AUTOMATION 306
ISSN: 1790-5117
ISBN: 978-960-474-054-3
0
filter transfer functions [dB]
−10
−20
Figure 4: Spectrum of the lowpass signal xlow (k).
−30
−40
−50
−60
0
1000
2000
3000
4000 frequeny [Hz]
5000
6000
7000
8000
Figure 2: Transfer functions of the lowpass (dotted line) and highpass filter (solid line). dard. However, G. 722 uses FIR filters with 24 coefficients and overlapping transfer functions for the two subbands. This results in aliasing when the subband signals are subsampled. However, the synthesis QMF filter bank at the receiver proposed in G. 722 ensures that this aliasing is canceled. With our approach we cannot apply heavily overlapping transfer functions, because the occurring aliasing could not be canceled. The transfer functions of the used elliptic lowpass and highpass filter are depicted in Fig. 2. They are designed with a small overlap and a minimum stopband attenuation of 50 dB. The splitting of the frequency bands is illustrated in figures 3 to 5.
Figure 5: Spectrum of the highass signal xhigh (k). Note that xmod (k) is the down sampled version of the discrete sinusoid sin( π2 k) = 0, 1, 0, −1, 0, 1, 0, −1, . . .. Considering the signal xhigh = (k), the modulation with sin( π2 k) results in the spectrum of Fig. 6, where the image of the high signal band (4 kHz- 8kHz) of the signal xhigh = (k) is mapped to the low signal band (0 kHz- 4kHz). Obviously, this signal can be subsampled by a factor of two without an anti aliasing filter. Therefore, this lowpass filter is omitted. We subsample the signal xhigh = (k) (fs0 = 8) and multiply it by the subsampled sinusoid xmod (k) resulting in the signal xmirror (k) with a spectrum according to Fig. 7.
Figure 6: Spectrum of the modulated signal xhigh (k) · sin( π2 k). Figure 3: Spectrum of the input signal and its images (dotted lines). From the down sampled signal x0low (k) we calculate the signal power by recursive smoothing Plow (k) = (1 − γ)Plow (k − 1) + γx02 low (k),
(1)
where γ ∈ (0, 1) is the smoothing parameter. To calculate the signal parameters of the highpass signal, this signal is mapped to an equivalent lowpass signal with the same frequency envelop. The signal xhigh = (k) is therefore down sampled and modulated (multiplied) with an alternating sequence xmod (k) = (−1)k = 1, −1, 1, −1, . . .. Figure 6 and Fig. 7 illustrate the mirroring of the highpass signal.
Figure 7: xmirror (k).
Spectrum of the modulated signal
The signal xmirror (k) is now analyzed by first order Linear predictive coding (LPC). We calculate the first two coefficients r0 (k), r1 (k) of the auto correlation function by recursive smoothing r0 (k) = (1 − γ)r0 (k − 1) +
Proceedings of the 8th WSEAS International Conference on SIGNAL PROCESSING, ROBOTICS and AUTOMATION 307
ISSN: 1790-5117
ISBN: 978-960-474-054-3
x′low (k)
SCO
LP
2 LPC analysis
xlow (k) x(k) ˆ
xmod (k) e(k)
A(z)
B(z) ACL
β
2
HP
xˆhigh (k)
ACL
α·δ
Figure 8: Decoder structure. γx2mirror (k) r1 (k) = (1 − γ)r1 (k − 1) + γxmirror (k)xmirror (k − 1).
(2) (3)
speech signal. The order of this filter is a design parameter of the receiver, where a first order filter already achieves reasonable results. The second stage is a reconstruction filter
From these two coefficients and the signal power Plow (k) of the lowpass signal we obtain the side information that is transmitted via the ACL link every 5 to 10 ms. s
Plow (k) r0 (k) r1 (k) β(k) = − . r0 (k)
α(k) =
(4) (5)
In summary, the side information consists of a power correction factor α and the single LPC coefficient β. This side information is transmitted asynchronously to the lowpass signal. This information should be sent at least every 10ms. More transmissions will slightly improve the fidelity of the speech signal. Because the ACL packets have a variable payload that has to be an integer number of bytes, we propose to encode the side information with three bytes. The power correction factor can be encoded with 6 bits, leaving 22 bits for the LPC coefficient. This results in a data rate of 2.4 kbps for the side information.
4
Decoding
The basic decoder structure is depicted in Fig. 8. The lowpass signal x0low (k) is received from the SCO link. This signal is upsampled to obtain the wideband signal component xlow (k). The highpass signal hhigh (k) is now estimated from x0low (k) and the side information received via the ACL link. The first stage of this estimation is a linear predictor that estimates the excitation signal e(k) of the
B(z) =
1 , 1 + βz
(6)
where β is the LPC coefficient received via the ACL link. The third stage corrects the signal power. Here α is the transmitted correction term. The term δ is a constant that depends on the filter order of the receiver’s linear predictor. This value is determined such that the ˆ high (k) equals signal power of the estimated signal h the power of the signal hhigh (k) on average. Finally, the signal is modulated and upsampled to ˆ high (k). The estimated speech obtain the estimate h signal results from the sum of the two signal components.
5
Complexity
The complexity of the encoding and decoding is mainly determined by the IIR filters. For each sample the IIR filtering of order 6 results in 13 multiplications and 12 additions/subtractions. The QMF structure consists of two such filters operating at 16 kHz. The LPC and power analysis running at 8 kHz require only three multiplications and one addition per sample. This results in the number of operations according to Table 1. The total number of operations per second is approximately 700,000, for encoding as well as for decoding.
Proceedings of the 8th WSEAS International Conference on SIGNAL PROCESSING, ROBOTICS and AUTOMATION 308
ISSN: 1790-5117
function power/LPC analysis LPC filtering IIR filtering
add./sub. 24,000 16,000 192,000
mult. 72,000 24,000 416,000
ISBN: 978-960-474-054-3
time jitter between the ACL and SCO link. The jitter strongly depends on the used Bluetooth equipment. This problem might be overcome by using the mixed ACL and SCO packet type as discussed in section 2. Unfortunately, this packet type was not supported by the employed Bluetooth stack. In summary, the presented speech coding approach has low complexity and a low latency. It achieves a significant improvement of the speech fidelity compared to narrow band speech.
Table 1: Number of operations per second.
input signal
Frequency [Hz]
8000
6000
4000
2000
0
0
1
2
3
4 Time [s]
5
6
7
Acknowledgements: The author would like to thank Markus Schmid for the implementation of the Bluetooth wideband speech transmission. This research was supported by HTWG Konstanz, University of Applied Sciences, Institut f¨ur Angewandte Forschung (IAF).
reconstructed signal
Frequency [Hz]
8000
6000
References:
4000
[1] 7 kHz audio-coding within 64 kbit/s. ITU-T Recommendation G.722, 1988.
2000
0
0
1
2
3
4 Time [s]
5
6
7
Figure 9: Spectrogram of a speech sample.
6
Results
Fig. 9 presents the spectrogram of a speech sample. The upper figure is the spectrogram of the original 16 kHz signal. The lower figure corresponds to the reconstructed signal. This figure demonstrates that the envelop of the highpass signal components is well approximated. Due to the transfer functions of the IIR filters there is a stopband around 4 kHz that cannot be reconstructed. First user tests with 12 persons (German language) revealed a significant improvement of the mean opinion score (MOS) of 0.6 (from 3.2 for narrow band speech to 3.8 with bandwidth extension). This result is comparable to the performance of the AMR-WB codec (with 12.65 kbps).
7
Conclusion
The presented speech codec approach was implemented by using a standard Bluetooth stack on a standard PC, utilizing different Bluetooth dongles. Similarly, the speech codec could be implemented with a standard Bluetooth chip where the side information is calculated in the hands-free car kit. The speech quality is mainly determined by the
[2] G.729-based embedded variable bit-rate coder: An 8-32 kbit/s scalable wideband coder bitstream interoperable with G.729. ITU-T Recommendation G.7291.1, 2006. [3] Adaptive Multi-Rate - Wideband (AMR-WB) speech codec, Transcoding functions. 3rd Generation Partnership Project, 3GPP TS 26.190, 2008. [4] P. Vary H. Taddei S. Schandl M. Gartner C. Guillaume S. Ragot B. Geiser, P. Jax. Bandwidth extension for hierarchical speech and audio coding in ITU-T Rec. G.729.1. IEEE Transactions on Audio, Speech, and Language Processing, 15:2496– 2509, 2007. [5] Ronald M. Aarts Erik Larsen. Audio Bandwidth Extension. Wiley, 2004. Application of Psychoacoustics, Signal Processing and Loudspeaker Design. [6] Focus Group on From/In/To Cars Communication II. Requirements on Wideband Bluetooth Connection from FG CarCom Point of View. ITUT, 2008. [7] P. Jax P. Vary. Bandwidth extension of speech signals: A catalyst for the introduction of wideband speech coding? IEEE Communications Magazine, 1:106–111, 2006.