Computer Communications 33 (2010) 1916–1927
Contents lists available at ScienceDirect
Computer Communications journal homepage: www.elsevier.com/locate/comcom
Media coding for the next generation mobile system LTE Kari Järvinen *, Imed Bouazizi, Lasse Laaksonen, Pasi Ojala **, Anssi Rämö Nokia Research Center, Visiokatu 1, 33720 Tampere, Finland
a r t i c l e
i n f o
Article history: Available online 18 April 2010 Keywords: Multimedia coding Voice coding Audio coding Video coding
a b s t r a c t Introduction of LTE (Long Term Evolution) brings enhanced quality for 3GPP multimedia services. The high throughput and low latency of LTE enable higher quality media coding than what is possible in UMTS. LTE-specific codecs have not yet been defined but work on them is ongoing in 3GPP. The LTE codecs are expected to improve the basic signal quality, but also to offer new capabilities such as extended audio bandwidth, stereo and multi-channels for voice and higher temporal and spatial resolutions for video. Due to the wide range of functionalities in media coding, LTE gives more flexibility for service provision to cope with heterogeneous terminal capabilities and transmission over heterogeneous network conditions. By adjusting the bit-rate, the computational complexity, and the spatial and temporal resolution of audio and video, transport and rendering can be optimised throughout the media path hence guaranteeing the best possible quality of service. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction 3GPP LTE provides higher data rates and lower latency than what is possible with UMTS. This enables improved Quality of Service (QoS), e.g. via the introduction of higher quality media coding. No specific LTE media codecs have yet been defined, but there is work ongoing in 3GPP for developing voice codec for the Enhanced Voice Service (EVS) of LTE and for updating the video codec support to suit the new capabilities of LTE. These are expected to provide substantial quality enhancements and potentially also improvements in transmission efficiency. The LTE voice codec work is focused on conversational services while the focus in the video codec work is on streaming and broadcasting services. In this paper, we discuss media coding for LTE, with emphasis on voice and video, covering both the technological improvements and the benefits for user experience. The potential new enablers and functionalities are reviewed with respect to improvements in user experience. Section 2 gives a brief overview of the 3GPP services ranging from real-time conversational telephony to streaming, broadcasting/multicasting and messaging. It reviews the voice, audio and video codecs specified for LTE and UMTS in the latest 3GPP release (Release 9) and explains their performance. Sections 3 and 4 discuss in detail the drivers for quality enhancement in the next LTE releases, the enablers and algorithms and
* Corresponding author. ** Corresponding author. E-mail addresses:
[email protected] (K. Järvinen), imed.bouazizi@nokia. com (I. Bouazizi),
[email protected] (L. Laaksonen), pasi.s.ojala@nokia. com (P. Ojala),
[email protected] (A. Rämö). 0140-3664/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2010.04.019
the resulting improved user experience for voice and video. Concluding remarks are given in Section 5. The 3GPP work for LTE-specific media coding is still ongoing for the next 3GPP release (Release 10) and hence this discussion is based on the current status in 3GPP complemented by the views of the authors of this paper. The authors believe that the most essential new feature to emerge within the enhancements is the scalability of voice and video coding which, when done with embedded coding, enables bit-rate adjustment (reduction) within the same encoded bit-stream in servers, networks and terminals. This will enable the quality of experience to stay optimal regardless of the varying network conditions, available bit-rates and terminal capabilities. 2. Codecs and services in 3GPP Release 9 2.1. 3GPP services 3GPP defines a set of multimedia services together with their related media codecs and transport protocols. These services can be classified as follows: Conversational (Telephony): 3GPP has defined a generic SIPbased telephony service for packet-switched conversational (PSC) multimedia applications [1,2]. It provides low end-toend delay enabling real-time audiovisual communication. A specific Multimedia Telephony Service for IMS (MTSI) has also been defined [3]. It exploits the rich capabilities of IMS (IP Multimedia Subsystem). In particular, multiple media components (voice, video and real-time text) may be present in a session
1917
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
and they can be dynamically added and dropped during a session. MTSI contains advanced features such as jitter management, measurement of quality of experience and adaptation of bit-rate, packet-rate and error resilience. Both MTSI and PSC reuse and extend the existing IETF Real-time Transport Protocol (RTP) and RTP Control Protocol (RTCP). Streaming: 3GPP Packet-switched Streaming Service (PSS) defines a system for the streaming of live or pre-stored content to users in a unicast mode [4]. PSS reuses and extends IETF protocols such as the Real-Time Streaming Protocol (RTSP) and the Real-time Transport Protocol (RTP/RTCP) for the control of the session and delivery of media data, respectively. Recently, 3GPP has also specified a solution for streaming media using the HTTP protocol. HTTP Streaming enables low-configuration and low-cost media streaming to mobile terminals. Broadcasting/multicasting: 3GPP provides a broadcast/multicast bearer as well as service building blocks for streaming and file download in Multimedia Broadcast/Multicast Service (MBMS) [5]. Applications such as Mobile TV or ring tone distribution may use the appropriate MBMS building blocks to deliver the content to users. PSS and MBMS may also be provided over IMS [6]. Messaging: Multimedia Messaging Service (MMS) [7] defines a system for sending multimedia messages where all media is combined into a composite single message. Its IMS based counterpart is IMS Messaging Service [8]. 3GPP carefully selects and specifies the appropriate media codecs for the target services. In this process, it takes into account the advanced needs of the users as well as the scarce network resources that will be shared by a large number of users. The latter goal evolves around scalability of 3GPP services and is critical for the usability of the services. The former goal ensures the best possible user experience and high user satisfaction levels and, by consequence, is critical for the success of the services. Media codecs are defined, e.g. for voice, audio, video, images, graphics, and text. In the following sections, we review the set of media codecs for voice, audio and video that is currently specified for use in 3GPP services in the latest 3GPP release (Release 9). Section 2.2 covers the voice and audio codecs. Section 2.3 gives an overview of the video codecs.
naturalness, presence and comfort while the high-frequency extension from 3400 to 7000 Hz provides better fricative differentiation and therefore higher intelligibility. The wider audio bandwidth of AMR-WB adds a feeling of transparent communication and eases speaker recognition. Both AMR and AMR-WB codecs have been designed to have low encoding delay making them suitable for real-time conversational services such as MTSI. Both codecs have multitude of bit-rates enabling rate adaptation. AMR contains eight bit-rates between 4.75 and 12.2 kbit/s and AMR-WB nine bit-rates between 6.6 and 23.85 kbit/s. The coding modes can be chosen dynamically on frame-to-frame basis (every 20 ms), e.g. adapting to changes in transport characteristics or network congestion. The lowest bitrates enable efficient transmission while the highest bit-rates give the full voice quality potential. The very lowest coding modes are intended to be used only temporarily during severe radio conditions or during congestion. Both codecs also contain a source controlled low bit-rate mode (less than 2 kbit/s) for coding background noise only. In the non-conversational messaging, streaming and broadcasting/multicasting services, the audio content is often music or mixed voice/music in addition to voice-only. To handle all these with high quality, specific audio codecs are needed. Two such are recommended in 3GPP: Extended AMR-WB [14,15] and Enhanced aacPlus [16]. These provide good quality audio (for music, speech and mixed content) at 24 kbit/s or below but for high quality stereophonic music bit-rates up to 48 kbit/s may be needed. The codecs enable high quality but impose high algorithmic delay making them unsuitable for telephony use. Fig. 1 illustrates the performance of 3GPP voice codecs based on subjective tests conducted by Nokia with 64 naïve listeners. The following codecs were tested: the AMR codec, the AMR-WB codec, the GSM Full-Rate (FR) codec and the GSM Half-Rate (HR) codec. All of these are narrowband codecs except AMR-WB. A novel extended-range Absolute Category Rating (ACR) methodology with 9-point Mean Opinion Score (MOS) scale was used instead of the more conventional 5-point scale to better distinguish the different quality levels between several high quality codecs. Grade ‘‘1” corresponds to ‘Very bad’ and ‘‘9” to ‘Excellent’. The listeners were instructed to evaluate the overall subjective quality and naturalness of the perceived voice. There were no instructions to focus on the audio bandwidth. The voice sequences in the tests consisted of
2.2. Voice and audio codecs
7.00 6.50 12650
23850
15850
6.00 5.50 8850
5.00
MOS
When LTE was introduced in 2008 into 3GPP Release 8, the codecs were simply inherited from UMTS since their functionality and performance were found suitable also for LTE. For example, the packet-based transmission features, such as jitter management and packet-based error concealment, had already been developed, evaluated and specified for 3GPP codecs in the Multimedia Telephony Service for IMS (MTSI) [3]. The UMTS voice codecs have proved to give good performance over packet-based transmission [9]. Release 9 in 2009 brought new functionalities for the use of the codecs in LTE, such as network congestion based media adaptation, but the codecs remained the same. The Adaptive Multi-Rate (AMR) codec [10,11], also known as AMR Narrowband (AMR-NB), is the default 3GPP voice codec. It must be supported in all voice capable terminals across UMTS and LTE. The AMR codec is a narrowband voice codec with conventional telephony audio bandwidth of 200–3400 Hz. Wider audio bandwidth is enabled by AMR Wideband (AMR-WB) codec [12,13]. It has a bandwidth of 50–7000 Hz and it is supported in all UMTS and LTE terminals that provide wideband voice, i.e. use 16 kHz sampling frequency. The introduction of wideband gives substantially improved voice quality and naturalness. The inclusion of low frequencies from 50 to 200 Hz contribute to increased
4.50
GSM EFR
6600
4.00
12200 7950
3.50 3.00
5900
AMR AMR-WB
GSM FR 13000
GSM EFR GSM HR
GSM HR 5600 4750
GSM FR
2.50 2.00 4000
8000
12000 16000 Bit-rate in bit/s
20000
24000
Fig. 1. Comparison of the performance of 3GPP voice codecs. The 95% confidence intervals are also shown. (A 9-point ACR was used instead of the more conventional 5-point ACR to better distinguish the different quality levels between several high quality codecs. Grade ‘‘1” corresponds to ‘Very bad’ and ‘‘9” to ‘Excellent’.)
1918
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
Table 1 Summary of video codec support in 3GPP Release 9. H.264/AVC baseline profile
H.264/AVC high profile
H.263 profile 0
H.263 profile 3
MPEG-4 visual simple profile
MTSI
Level 1.1 QCIF@30 Hz
–
Level 45 QCIF@15 Hz 128 kbps
Level 45 QCIF@15 Hz 128 kbps
Level 3 CIF@30 Hz 384 kbps
PSS
Level 1.3 CIF@30 Hz 768 kbps
Level 3.0 625SD@25 Hz 10 Mbps
Level 45 QCIF@15 Hz 128 kbps
Level 45 QCIF@15 Hz 128 kbps
Level 3 CIF@30 Hz 384 kbps
MBMS
Level 1.3 CIF@30 Hz 768 kbps
–
Level 45 QCIF@15 Hz 128 kbps
–
–
MMS
Level 1.2 CIF@15 Hz 384 kbps
–
Level 45 QCIF@15 Hz 128 kbps
Level 45 QCIF@15 Hz 128 kbps
Level 3 CIF@30 Hz 384 kbps
samples from male and female speakers obtained from real-life audio capture done indoors (office), inside a moving car (with also music from radio) and outdoors (street). All voice sequences contained varying levels of environmental noise and reverb. As seen in Fig. 1, AMR-WB provides a substantial leap in voice quality over the narrowband codecs. A wide deployment of wideband voice using this codec is now taking off in UMTS. In LTE, AMR-WB will be commonly deployed from the start. Further quality enhancements and new functionalities in LTE are expected with the introduction of the Enhanced Voice Service (EVS) codec. 2.3. Video codecs All multimedia services in 3GPP Release 9 specify support for the baseline profile of H.264/AVC (Advanced Video Codec). H.264/AVC [20] represents the most recent developments in video compression and has, since its definition, enjoyed wide adoption and deployment. When compared to earlier video coding standards such as MPEG-2, H.264/AVC achieves the same objective and subjective video quality at half the bit-rate needed by the MPEG-2 video codec. H.264/AVC Baseline Profile is recommended for all 3GPP multimedia services. The minimal complexity required from terminals is defined by the level requirement, which has been upgraded in Release 9 to level 1.3 for PSS and MBMS and to level 1.1 for MTSI. In addition to the Baseline Profile, higher quality and resolution video may be offered to mobile terminals with advanced capabilities by deploying the High Profile of H.264/AVC. When the High Profile is used, mobile terminals are expected to support level 3.0 which allows standard definition video quality, i.e. 720 576 at 25 Hz. H.263 Baseline Profile is the mandatory video codec to be supported in terminals for all 3GPP multimedia services (except MBMS). The continued support for H.263 is mainly due to legacy issues. Level 45 is set as the minimal mandatory requirement for all terminals. Finally, the Simple Profile of MPEG-4 Visual and Profile 3 of H.263 have also been recommended for usage by 3GPP multimedia services (except MBMS) as they provide some additional coding tools compared to the H.263 Baseline Profile. Due to the multicast/broadcast nature of the service and lack of capability negotiation, only the support of H.264/AVC baseline profile has been defined for MBMS. Table 1 summarizes the video codec support as well as the resulting video characteristics in LTE Release 9. 3. Voice coding for LTE 3.1. Background and drivers A feasibility study on Enhanced Voice Service (EVS) for LTE has recently been finalised in 3GPP with the results given in Technical
Report 22.813 ‘‘Study of Use Cases and Requirements for Enhanced Voice Codecs in the Evolved Packet System (EPS)” [17]. EVS is intended to provide substantially enhanced voice quality for conversational use, i.e. telephony. Improved transmission efficiency and optimised behaviour in IP environments are further targets. EVS also has potential for quality enhancement for non-voice signals such as music. The EVS study, conducted jointly by 3GPP SA4 (Codec) and SA1 (Services) working groups, identifies recommendations for key characteristics of EVS (system and service requirements, and high level technical requirements on codecs). The study further proposes the development and standardization of a new EVS codec for LTE to be started. The codec is targeted to be developed by March 2011, in time for 3GPP Release 10. Fig. 2 illustrates the concept of EVS. The EVS codec will not replace the existing 3GPP narrowband and wideband codecs AMR and AMR-WB but will provide a complementing high quality codec via the introduction of higher audio bandwidths, in particular super wideband (SWB: 50–14,000 Hz). It will also support narrowband (NB: 200–3400 Hz) and wideband (WB: 50–7000 Hz) and may support fullband audio (FB: 20–20,000 Hz). The following sub-sections discuss the most important enhancements that the EVS codec is expected to give. The discussion is based on the developments in 3GPP but it also reflects the views of the authors of this paper.1 3.1.1. Quality of user experience The main driver for EVS is to significantly improve the user experience (naturalness of communication) for both two-party and conference calls. High voice quality is especially important in conference calls for better speaker recognition and to avoid listener fatigue during long calls. In addition, new functionalities such as stereo and spatial localisation of the participants could greatly enhance the user experience. Achieving consistent high quality in multiparty calls is, however, challenging since the transmission may take place over heterogeneous networks (allowing different transmission bandwidths and using different codecs), participants may be using terminals with different capabilities (e.g. with regard to audio bandwidth), and the acoustic and background noise environments from where the calls are made may vary considerably. Widening the audio bandwidth is a key vehicle to achieve significant improvement in voice quality. The impact of audio bandwidth to voice quality is illustrated in Fig. 3. The figure shows results from a 9-point ACR listening test, carried out by Nokia, where the listeners were presented the same voice samples but with different audio bandwidths. The test configuration and voice 1 During the time of finalising this paper (March 2010) the EVS study was just completed in 3GPP and the EVS codec development phase was started. The codec development phase will set the detailed performance requirements and design constraints for the EVS codec.
1919
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
UMTS and LTE voice codecs in 3GPP Release 8 and 9
LTE voice codecs in future 3GPP releases
Enhanced narrowband, wideband, and super wideband: EVS codec
Wideband: AMR-WB codec
Wideband: AMR-WB codec
Narrowband: AMR codec
Narrowband: AMR codec
(a)
Enhanced Voice Service
Voice Service
(b)
Fig. 2. Illustration of the introduction of EVS to LTE. (a) The voice codecs defined for UMTS and LTE in 3GPP Release 8 and 9. (b) The voice codecs for LTE in the future releases. (Support of AMR and AMR-WB codecs is required in terminals across the 3GPP multimedia services. AMR-WB needs to be supported only in terminals offering wideband voice communication at 16 kHz sampling frequency. The use of other codecs is not excluded.)
8.00 7.00
MOS
6.00 5.00 4.00 3.00 2.00 0
4000
8000
12000 16000 20000
Signal bandwidth in Hz
about quality of experience. For example, high definition television (HDTV) is gaining ground. Thus, it would be quite natural to follow the trend also in telephony service. The target should not be less than one equivalent to face-to-face conversational quality. The 3GPP EVS study gives explicit performance recommendations for the EVS codec [17]. The EVS codec should achieve significantly better voice quality than what is possible in 3GPP Release 9. It should obtain enhanced quality and coding efficiency for narrowband and wideband voice, and enhanced quality by the introduction of super wideband voice. In addition, stereo/multichannel coding capability is seen as a way to realize significantly improved quality of experience. The choice of whether using dedicated stereo/multi-channel coding or multiple monophonic codings depends on a trade-off between achievable quality, available bit-rates, available delay, complexity and other implementation factors.
Fig. 3. Impact of different audio bandwidths on voice quality.
samples were the same as used for testing 3GPP codecs and are explained in Section 2.2. There is a drastic increase in the user experience when going from narrowband (NB) to wideband (WB) audio representation, and again a significant improvement when going from wideband to super wideband (SWB). Widening the audio bandwidth further from super wideband to fullband (FB) seems to have less impact and the perceived voice quality will gradually saturate. In addition to widening the audio bandwidth, improved user experience can be obtained through the introduction of new functionalities such as stereo and multi-channel coding (e.g. surround sound). Multi-channel audio capture and reproduction has the ability to convey significant amount of information about the caller environment. For example, the audio scene can include position information of the teleconference participants. The positioning can be either naturally captured or artificially rendered at the teleconference bridge. The spatial audio can be used to convey the real audio environment of the caller to the listener. For example, grandparents can listen to their grandchildren playing in their home environment. This will create much more intimate feeling than with current telephony. Overall, people are becoming more aware
3.1.2. Mixed content and music It is not uncommon that the caller needs to listen to music on hold or messages (informative or advertising) with mixed voice and music content. High quality coding for these, comparable to that obtained by the 3GPP audio codecs, would therefore be desirable also for telephony. The ability to reproduce high quality music as a background signal would also convey the ambiance better. Hence, the EVS study recommends the EVS codec to bring significant improvement not only for voice but also for music and mixed content in conversational services [17]. The quality enhancement should not be limited to voice only. 3.1.3. Backward service interoperability The introduction of EVS brings a challenge how to guarantee interoperability with 3GPP networks and devices in earlier releases (that are based on legacy voice codecs such as AMR-WB) and mobility when moving between cells with and without LTE support. User experience should not be compromised by the use of transcoding since it reduces voice quality due to the extra coding and increased transmission delay. Finding a common codec endto-end and hence entirely avoiding transcoding cannot always be guaranteed. Therefore, the EVS codec should be interoperable with the existing 3GPP codecs. Considering the target of high voice
1920
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
quality in LTE, the most obvious approach is to make EVS interoperable with the AMR-WB codec. This ensures high voice quality with seamless backwards compatibility and enables straightforward migration from the existing voice services towards new functionalities and service provision with LTE. The EVS study [17] recommends the EVS codec to achieve backward interoperability to the existing 3GPP AMR-WB codec by supporting the AMR-WB codec modes used in 3GPP conversational speech telephony services. These AMR-WB interoperable operation modes of the EVS codec may be either identical to those in the AMR-WB codec or different but bit-stream interoperable with them. The interoperable modes are foreseen to become an alternative implementation of the AMR-WB codec (i.e. allowed to be used instead of the AMR-WB codec) provided that the enhancements over AMR-WB are consistently significant. 3.1.4. Efficiency Although LTE provides improved transmission efficiency and also allows higher data rates, it is still important that enhanced voice codecs enable efficient use of transmission resources in access and transport networks [17]. EVS should exceed the transmission efficiency of the existing wideband voice service. New codecs should also allow implementation on low-cost devices and network equipment with limited computational resources [17]. Any increase in computational complexity and memory usage must be justified with improved user experience or increased transmission efficiency. The main objectives of the EVS codec development are summarised in Table 2. 3.2. Algorithms This section discusses some prominent algorithms for the EVS codec, as seen by the authors of this paper. 3.2.1. Super wideband and fullband Widening the audio bandwidth to super wideband should be the core of the quality improvements in EVS as explained in Section 3.1.1. Spectral replication techniques provide efficient means to extend from wideband to super wideband, and also from super wideband to fullband. They produce the high-frequency content by replicating and suitably modifying the content of the lower frequencies. Carrying out spectrum replication in the Modified Discrete Cosine Transform domain (MDCT) forms an efficient bandwidth extension algorithm as described in [18]. Using MDCT makes it relatively easy to detect similar shapes and patterns in the spectrum for high quality replication. The high-frequency half of the signal (to be produced by spectrum replication) is split into several non-overlapping subbands and for each of them the best match is searched from the quantized and synthesized signal of the low-frequency half. With suitable scaling applied for each selected subband, the high-frequency half can be reproduced from the lowfrequency half without the need to transmit the actual high-fre-
quency half of the signal. In [18], the search for the best match is done in two steps to facilitate an optimal match: first in the linear domain to match the spectral amplitude peaks and then in the logarithmic domain to provide a perceptually better match with the finer details of the spectral shape. Highly periodic tonal signals that have clear energy peaks would need an unrealistically high number of subbands for accurate spectrum replication. Therefore, a different approach is employed for them. Instead of replicating subbands, their high-frequency half is represented by suitably scaled sinusoidals. This representation of high-frequency half by sinusoidals can also be employed as an enhancement on top of using subband replication. The MDCT-method is easily scalable and it is thus quite straightforward to utilize for successive extensions from wideband to super wideband and further from super wideband to fullband. Extending wideband to super wideband is achieved with about 4–8 kbit/s, and reaching fullband from wideband is achieved with about 12–16 kbit/s. Fig. 4a illustrates the general concept of spectrum replication to create the extended bandwidth by using the lower frequency components of the spectrum. Fig. 4b shows the original spectrum and the one produced by spectrum replication (‘‘processed”) using the above described MDCT-method. 3.2.2. Stereo and multi-channel Real-time (conversational) multi-channel content capture may be performed with arbitrary multi-microphone set-up or the content may be mixed together at the teleconference server. The former case implies multi-channel encoding in the terminal, while in the latter case the network based multi-channel and multi-user server handles the mixing and encoding of the content. Both should be handled well in EVS. The rendering scheme of the multi-channel content needs to be flexible to allow stereo, multi-loudspeaker and binaural headphone representation. Parametric representation, such as binaural cue coding (BCC) [19], is an efficient approach for stereo and multi-channel audio coding. It also allows flexible rendering for both arbitrary multiloudspeaker and binaural headphone listening. Parametric coding of stereo and multi-channel audio is inherently scalable since the bit-stream typically consists of monophonic audio (encoded by audio or voice codec) and of spatial side information (that carries the representation of stereo and multi-channel audio image). Therefore, when rendering the audio to a single loudspeaker using only mono content, the decoder or even the network may ignore the side information and hence save computational resources and channel bandwidth. The parameterization of the spatial audio image is usually independent of the input and output channel configuration. Since the parameterisation represents the audio image, i.e. the location and coherence of the audio sources around the caller, the content can be rendered into any output configuration and any number or output channels. The parametric stereo and multi-channel audio representation is typically created by decomposing the signal into time–frequency domain in which the inter-channel level and time differences as
Table 2 Summary of the main objectives for the EVS codec. EVS codec development objectives Enhanced quality and coding efficiency for narrowband (NB: 200–3400 Hz) and wideband (WB: 50–7000 Hz) speech Enhanced quality by the introduction of super wideband (SWB: 50–14,000 Hz) speech Enhanced quality for mixed content and music in conversational applications (e.g. in-call music) Robustness to packet loss and delay jitter leading to optimised behaviour in IP environments Backward interoperability to AMR-WB by EVS modes supporting the AMR-WB codec format. The AMR-WB interoperable operation modes of the EVS codec may be either identical to those in the AMR-WB codec or different but bit-stream interoperable with them. They may become an alternative implementation of the AMR-WB codec, provided that the enhancements are consistently significant
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
well as correlation cues are determined. Such a parameterisation enables very low bit-rate scalable multi-channel coding. The granularity of time–frequency transform and the number and resolution of the parameters affect the overall bit-rate as well as the quality of experience. With a bit-rate of 4–20 kbit/s, parametric stereo and multi-channel coding can support a variety of enhancements ranging from wideband stereo voice to fullband spatial audio. Parametric coding is the most prominent approach for EVS to support multi-channel audio and bit-rate scalability. It should be noted that the use of stereo and multi-channel signals for conversational use imply significant changes in the audio capture and reproduction compared to the existing devices and services. First, a multi-microphone set-up is needed in the terminal. Second, since the intention is to capture the spatial audio around the user, the noise suppression algorithms need to be tuned to pass the real audio environment of the caller. The most advanced noise suppression algorithms used today in terminals are so efficient that it is often difficult to hear weather the
1921
caller is in a moving car or in office. Fewer changes are needed on the rendering side since terminals usually support stereo playback. For surround sound, multi-channel rendering needs to be incorporated for multi-loudspeaker arrangement and for binaural listening. 3.2.3. Interoperability and scalability Interoperability can be best achieved via bit-stream compatibility with AMR-WB as explained in Section 3.1.3. To enable use over heterogeneous networks and with different terminal capabilities, the authors believe that the EVS codec should also provide scalability by using an embedded structure. This means that the codec will be based on a core layer and several enhancement layers built on top of it with the core layer being bit-stream interoperable with the AMR-WB codec (with all the nine AMR-WB codec modes). The enhancement layers will bring new functionalities such as bandwidth extension (super wideband and fullband) and spatial extension (stereo and multi-channel) and the associated enhanced
Fig. 4. (a) Illustration of spectral replication. (b) The original spectrum and the one produced by spectrum replication (‘‘processed”) using the MDCT-method with eight subbands.
1922
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
quality. An example of such an embedded codec structure is shown in Fig. 5. The bandwidth extension layers and the spatial extension layers are independent from each other. That is, the received bit-stream may contain, in addition to the core layer, any combination of bandwidth extension and spatial extension layers, and the decoder may discard any number of such extension layers from the bitstream and still be able to fully decode voice based on the selected lower layers. The decoder or network may discard certain functionalities to adapt to the terminal or transmission capabilities or to increase battery life-time. For example, the stereo and multi-channel extension layers can be discarded when mono audio rendering is selected. The scalability works similarly also in the encoding and capture side. The encoder may choose to avoid high sampling rates, i.e. bandwidth extension algorithms, or multi-channel capture due to terminal or network constraints, or based on the targeted quality of service. The bandwidth extension can be understood as the basic quality of experience scalability. Stereo and multi-channel/binaural extensions represent the additional dimension of spatial scalability. The computational complexity is typically increasing when enhancement layers are included. Hence, complexity scalability is a further aspect of the presented codec structure. The quality of experience, spatial and computational complexity scalabilities should be elementary functionalities of the EVS codec. Fig. 6 shows examples of layer configurations for EVS from the lowest to the highest quality when using the AMR-WB core codec and bandwidth and spatial extensions. These examples have been chosen to allow the best quality and functionality for each bit-rate. 3.3. User experience The performance that can be expected from EVS was assessed based on Nokia proprietary EVS codec research. Several codec versions, employing various configurations and different functionalities, were tested and the results are summarised in Fig. 7. Note that the shown expected EVS performance curve does not represent any specific measurement points for a single codec version but it collectively illustrates the minimum performance that can be expected from the future EVS codec based on the tests. All tested codec versions and configurations use the embedded structure around the AMR-WB core. They provide the essential enhancements as identified in Section 3.1 and are using the enablers and algorithms as explained in Section 3.2.
Fig. 6. Examples of possible layer configurations from lowest to the highest quality, from left to right.
The conditions under test were the audio bandwidth (WB, SWB and FB), the number of channels (mono, stereo) and a variety of bit-rates (up to 48 kbit/s). ACR with 9-point MOS was used. The test configuration and voice samples are the same as used for testing existing 3GPP codecs and are explained in Section 2.2. The listeners were instructed to evaluate the overall subjective quality and naturalness of the perceived voice. There were no instructions to focus on the audio bandwidth or spatial image. The experiments were carried out using binaural headset to incorporate the stereo content. As seen, EVS can provide substantial improvement when super wideband is introduced. Applying bandwidth extension, high quality monophonic super wideband audio can be provided at bit-rates of about 20 kbit/s and above. EVS with SWB mono outperforms EVS with WB stereo indicating the significant role widening the band-
Fig. 5. Embedded codec based on AMR-WB core with integrated bandwidth and spatial extension.
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927 9.00 8.50 8.00 FB stereo
7.50
SWB mono
7.00 6.50
WB mono
W
6.00
MOS
5.50
B
7950
3.00 4750
2.50
23850
15850
12200
6600
3.50
W
SWB/FB stereo
8850
NB mono
4.00
mono
Bs
12650
5.00 4.50
o m
no
SW B o tere
GSM HR, FR
13000
AMR
5900 5600
AMR - WB expected EVS performance
2.00 1.50 1.00 0
8000
16000
24000
32000
40000
48000
Bit-rate in bit/s
Fig. 7. The expected performance of the EVS codec. Horizontal lines represent the quality measured for direct narrowband, wideband, super wideband and fullband reference signals (16-bit PCM).
width has for voice quality. However, care should be taken in conclusions since the benefits of stereo greatly depend on the type of test material. With a larger proportion of content with more importance on stereo image (e.g. teleconference with several speakers) WB stereo could have been assessed higher. EVS with super wideband stereo and fullband stereo give further enhancement but these are obtainable only at bit-rates around 40 kbit/s and upwards.
4. Video coding for LTE 4.1. Drivers for improved video in LTE The capabilities of 3GPP devices are evolving steadily as a result of the technological developments affecting the different components of those devices. With the development of LTE, a significant boost is given to one of the main 3GPP device components, the data network connectivity. Data rates up to 300 Mbit/s on the downlink and 75 Mbit/s on the uplink become feasible, although reaching these maximum rates requires usage of all high complexity LTE system tools and exclusive cell access by a single receiver. These high data rates are attributable to the newly possible system tools and configurations, briefly described below: Scalable channel bandwidth: channel bandwidth may be selected from a wide range of possible values covering channel bandwidths from 1 kHz and up to 20 MHz. Modulation: 64QAM modulation is possible on down- and uplink. Multiple Input Multiple Output (MIMO): spatial multiplexing is enabled with the MIMO antenna technology and may be used in different modes in LTE, reaching up to the 4 4 MIMO mode. New Multiple Access schemes: Orthogonal Frequency Division Multiple Access (OFDMA) is used in downlink and Single Carrier-FDMA is used in the uplink. Starting from Release 9, LTE supports the deployment of broadcast bearers for the delivery of multimedia services to large masses and at high bit-rates. This enhanced MBMS (eMBMS) is currently functionally limited to the broadcast mode over Single Frequency Networks (SFN), where all transmission cells are tightly synchro-
1923
nized. The support for multicast mode over multiple frequency networks will follow in later releases. As a prime target service of LTE eMBMS, mobile TV will benefit from significantly higher bit-rates and is then able to provide high quality content for improved user experience. This development is also endorsed by the continuously evolving processing and display capabilities on mobile devices. Release 9 mobile devices will show a wide variety of display resolutions, ranging from QVGA (320 240) to VGA (640 480) and WVGA (800 480). Their media decoding capabilities will also vary accordingly. The resulting heterogeneity poses significant challenges on 3GPP multimedia service providers. Trying to strike a good compromise to balance the trade-off between catering for low capability device users and satisfying owners of high capability mobile devices with a single content representation is condemned to fail. However, with the use of LTE a significantly higher downlink channel capacity will be available. For example, a typical eMBMS SFN network may be able to make a downlink bandwidth of up to 12 Mbit/s available at a spectral efficiency of 1 bps/Hz and channel bandwidth of 20 MHz (given that only 6 out 10 sub-slots may be allocated to eMBMS traffic), which may provide sufficient bandwidth for about 12 VGA (at 1000 kbit/s) or 30 QVGA (at 384 kbit/s) resolution mobile TV channels. Service providers may then offer multiple versions of the same content simultaneously; thus, providing nearly personalized services that are tailored to a wide variety of device capabilities and users’ needs. This comes at the cost of increased bandwidth consumption. Recognizing this trend, 3GPP has started studying the use cases for advanced video support and is evaluating different video codecs against the selected use cases. The goal of the work is to cater for advanced terminals with higher device capabilities in a bandwidth efficient manner. As a result of the Release 9 work, a technical report [21] that collects the identified use cases and solutions has been produced. The evaluation of the solutions for the use cases is expected to be performed as part of Release 10. Section 4 discusses the potential improvements for LTE in Release 10 that are achievable through the introduction of scalable video coding, which emerges as a suitable candidate technology designed to address device capability heterogeneity and by providing appropriate video encoding tools to generate video bit-streams that may be consumed at different configurations. 4.2. Scalable video coding2 Scalable video coding has been widely investigated in academia and industry during the past 20 years. However, before the standardization of the H.264/SVC (Scalable Video Coding) extension [20], scalable video coding was always linked to increased complexity and drop in coding efficiency when compared to non-scalable video coding. Hence, scalable video coding was rarely used and alternative techniques such as the deployment of simultaneous transmissions (simulcast), which provide similar functionalities as scalable video coding by transmission of two or more independent single layer streams simultaneously, were preferred. Though simulcast causes significant increases in the resulting total bit-rate, there is no penalty in the complexity. The H.264/SVC extension has already proven to be a commercially attractive solution and is already being adopted by several system and application standards (e.g. DVB). This is on the one hand due to the manifestation of real demand for scalability that is incurred by the increasingly heterogeneous consumer landscape. 2 Section 4.2 reuses parts of TR 26.903 [21].Ó 2009. 3GPP™ TSs and TRs are the property of ARIB, ATIS, CCSA, ETSI, TTA and TTC who jointly own the copyright in them. They are subject to further modifications and are therefore provided to you ‘‘as is” for information purposes only. Further use is strictly prohibited.
1924
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
On the other hand, the success of H.264/SVC is certainly also attributable to the technical merits of the solution. H.264/SVC was designed to address earlier shortcomings of scalable video coding solutions such as increased complexity and significant compression efficiency penalty (compared to similar single layer compression). The former is achieved by imposing base layer compatibility with H.264/AVC streams as well as the single-loop decoding. The latter is achieved by a sophisticated set of inter-layer prediction tools that is described in detail in the sequel. H.264/SVC supports three different types of scalability: spatial scalability, temporal scalability, and quality scalability. Temporal scalability is realized using the already existing reference picture selection flexibility in H.264/AVC as well as bi-directionally predicted pictures (B-pictures). The prediction dependencies of B-pictures are arranged in a hierarchical structure. Furthermore, appropriate rate control is used to adjust the bit budget of each picture to be proportional to its temporal importance in a procedure called quantization parameter cascading. The slightly and gradually reduced picture quality of the hierarchical B-pictures is proven not to significantly impact the subjective quality and the watching experience, while showing high compression efficiency. Fig. 8 shows an example of the realization of temporal scalability using hierarchical B-pictures. The example shows four different temporal levels, resulting in one base layer, which consists of only Intra-pictures (I) and unidirectionally Predicted pictures (P), and three temporal enhancement layers, constructed by the B-pictures. This allows the frame rate to be scaled down by a factor up to 8 (e.g.
from 60 Hz to 7.5 Hz). Unfortunately, this approach has the drawback that it incurs a relatively high decoding delay that is exponentially proportional to the number of temporal layers, since the pictures have to be decoded in a different order than the display order. As the coding gain also diminishes with the increasing number of hierarchy levels, it is not appropriate to generate a high number of temporal layers. An alternative to the above-mentioned approach for temporal scalability is the use of low-delay uni-directional prediction structures, hence avoiding the out-of-display-order decoding at the cost of reduced coding efficiency. Spatial scalability is the most important scalability type in H.264/SVC. It enables encoding a video sequence into a video bitstream that contains one or more subset bit-streams and where each of these subsets provides a video at a different spatial resolution. The spatially scalable video caters for the needs of different consumer devices with different display capabilities and processing power. Fig. 9 depicts an example for a prediction structure for spatial scalability (QCIF to CIF resolution) that shows the cross layer prediction relationships within one Group of Pictures (GOP), which starts with a key picture that is independently decodeable at all layers. The spatial scalability layer is enhanced with an additional temporal scalability layer that doubles the frame rate at the CIF resolution. H.264/SVC defines three different inter-layer prediction modes (see Fig. 10) that are designed to enable the single-loop low complexity decoding at the decoder. In other words, motion compensation is performed only once at the target layer at the decoder. The
Fig. 8. Temporal scalability with hierarchical B-picture structure in H.264/SVC.
motion-compensated prediction
GOP border
GOP border
CIF
inter-layer prediction
QCIF key pictures
key pictures Fig. 9. Example prediction structure for spatial scalability.
1925
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
Upsampling
Upsampling
Upsampling
Fig. 10. Inter-layer prediction modes.
inter-layer prediction tools are inter-layer INTRA (texture) prediction, inter-layer motion prediction, and inter-layer residual prediction. Inter-layer INTRA prediction enables texture prediction from the base layer at co-located macro-blocks (after up-sampling). It is restricted to INTRA coded macro-blocks at the lower layer. The up-sampling of the macro-block texture is performed using to well-specified up-sampling filters (a 4-tap filter for Luma samples and bi-linear filter from Chroma samples). Inter-layer motion prediction implies prediction of the base layer motion vector from the co-located INTER-coded macro-block (after up-sampling) of the lower layer. The prediction involves all components of the motion vector: the macro-block partitioning structures, the reference picture indices, and the x- and y- components representing the motion direction. Finally, the inter-layer residual prediction allows inter-layer prediction from the residual after INTER-prediction at the lower layer. At the decoder side, the residual information of the target layer is built up by summing all correctly up-scaled residuals of the lower dependent layers. The third scalability type in H.264/SVC is quality scalability. Quality scalability enables the achievement of different operation points, each yielding a different video quality. Coarse Grain Scalability (CGS) is a form of quality scalability that uses the same tools as the spatial scalability, hence operating in the spatial domain. Alternatively, Medium Grain Scalability (MGS) may be used to achieve quality scalability performing the inter-layer prediction at the transform domain. Two techniques are advocated for MGS scalability: splitting number of transform coefficients and encoding difference of transform coefficients quantized using different quantization parameters. MGS significantly reduces the complexity at encoder and decoder. CGS may be seen as a variant of spatial
scalability where the spatial scaling factor is set to one. Quality scalability may be used to address different use cases such as rate adaptation or for offering a high quality pay service. Table 3 summarizes the tools that SVC offers to enable the different modes of scalability. 4.3. Performance evaluation In this section, we evaluate the performance of H.264/SVC and show how it matches the selected use cases of 3GPP multimedia services over LTE [21]. The following figures depict evaluation results comparing the performance of simulcast to SVC. The results are extracted from evaluations at different bit-rates, ranging from 150 to 300 kbit/s for the low resolution (QVGA) and 400 kbit/s to 1500 kbit/s for the high resolution (VGA). Temporal scalability between the low resolution video and the high resolution video is also performed, resulting in frame rate enhancement from 12.5 to 25 Hz (frames per second). Considering the broadcast use case, the frequency of random access points (RAP), which are used to start correct decoding, have been configured to meet fast tune-in and channel switch requirements. For those purposes, SVC video receivers may benefit from fast random access at the low resolution video while improving the compression efficiency by reducing the RAP frequency at the higher layer. During this evaluation, the RAP frequency for the low resolution video is set to 8 (every eighth picture is encoded as a RAP) and it is set to 16 or 80 for the high resolution video. The evaluation also covers the equivalent baseline profile (BP) and high profile (HP) variants of H.264/AVC. H.264/SVC is used at the scalable baseline profile (SB), which requires the base layer
Table 3 Summary of tools that SVC offers to enable the different modes of scalability. Temporal scalability
Spatial scalability
Quality scalability
Tools
Hierarchical B-pictures
Inter-layer prediction: Inter-layer INTRA, inter-layer motion and inter-layer residual predictions
Coding of residual (CGS) Coding of transform coefficients and/or splitting of transform coefficients (MGS)
Use cases
Different frame resolutions resulting in increasingly smoother video
Appropriate video resolution and complexity for the target device
Fine tuning of the rate to statically or dynamically adapt the video to the available bit-rate
1926
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
to be compliant with the H.264/AVC baseline profile. This is a crucial requirement since it guarantees backwards compatibility with existing devices that support the H.264/AVC baseline profile. The video sequences are initially in VGA@25 Hz resolution. The target is to achieve temporal and spatial scalability resulting in a low quality video of resolution
[email protected] Hz that addresses the needs of legacy and low complexity terminals, while at the same time providing advanced terminals with high quality video of resolution VGA@25 Hz. SVC is compared against H.264/AVC Baseline Profile (BP) for the QVGA video and against H.264/AVC Baseline Profile (BP) and High Profile (HP) for the high quality video. The resulting simulcast performance is achieved through summation of the required bit-rate for the QVGA and the VGA video for the specific quality level. Consequently, two result sets for simulcast are extracted: BP-BP, which uses H.264/AVC BP for both QVGA and VGA video; and BP-HP, which uses H.264/AVC BP for the QVGA video and H.264/AVC HP for the VGA video.
The results show significant gains that are achieved by SVC compared to simulcast of the low and high quality video. The simulcast results are calculated as the operation points representing the achieved video quality at the sum of bit-rates of the low and high quality video. Fig. 11 compares the different video compression modes for the City MPEG test sequence. Fig. 11a shows performance results for low quality video of resolution
[email protected] Hz and Fig. 11b shows performance results for high quality video of resolution VGA@25 Hz and for simulcast. Fig. 11a compares the baseline profile to the SVC base layer (with RAP frequencies of 16 and 80 at the enhancement layer). Slight objective quality gains (on the Luminance component Y) for H.264/AVC baseline profile (BP) are seen when compared to SVC. This may be explained by the slight signaling overhead introduced by SVC. The simulcast implementation assumes separate encoding of the QVGA and VGA videos. The two tested SVC configurations have very similar performance at the
Fig. 11. Objective video quality of SVC and simulcast for city video sequence. (a) Performance results for low quality video of resolution
[email protected] Hz. (b) Performance results for high quality video of resolution VGA@25 Hz and for simulcast.
Fig. 12. Objective video quality of SVC and simulcast for soccer video sequence. (a) Performance results for low quality video of resolution
[email protected] Hz. (b) Performance results for high quality video of resolution VGA@25 Hz and for simulcast.
K. Järvinen et al. / Computer Communications 33 (2010) 1916–1927
base layer. Fig. 11b shows a significant right shift of these rate-distortion curves for the simulcast implementations (BP-BP and BPHP) compared to the SVC implementations (SVC with RAP periods 8/16 and 8/80). By quantifying the right shift, information about the bit-rate saving potential of SVC can be gained. The savings range from approximately 20% (between SVC 8/16 and simulcast BP-HP) and up to 40% (between SVC 8/80 and simulcast BP-BP) for the City sequence. Fig. 12 depicts similar results for the Soccer MPEG test sequence. A similar small compression penalty is observed for the QVGA video. The curves in Fig. 12b show slightly lower bit-rate savings than in Fig. 11b for the VGA video when using SVC. Because of the high motion nature and the frequent scene changes of the Soccer video sequence, no significant differences can be observed for SVC due to the different RAP frequencies of the VGA video. Significant bit-rate savings compared to the simulcast implementations can be observed for both test sequences. The results reflect the bit-rate saving potential brought by SVC in multicast and broadcast scenarios, where terminals are expected to have mixed requirements (some requesting the low quality video and others requesting the high quality video). This makes H.264/SVC well suited to address the new needs of 3GPP services that are designed to serve basic and advanced terminals simultaneously. 5. Conclusion Evolution of 3GPP to LTE provides new possibilities to introduce flexible and scalable multimedia services with substantial improvement in voice and video quality. The increased bit-rate and low latency of LTE transport enable new functionalities and improved quality of service. Scalable coding technology may be exploited to provide consistent high quality over heterogeneous network conditions, to enable quality enhancements for variety of terminals with different capabilities, to improve transmission efficiency, and to enable backwards compatibility to legacy services and codecs. By the use of wider audio bandwidths the voice quality in 3GPP will significantly improve. Combined with the use of stereo and multi-channel audio the naturalness of communication can take a major step towards face-to-face user experience. Interoperability with legacy codecs provides service continuity, and scalability enables the same codec dynamically adapted to available transmission conditions and resources hence guaranteeing the best possible quality. Scalability of video poses challenges due to the strong heterogeneity in terminal processing and display capabilities. Scalable video coding would offer an efficient solution to achieve video scalability
1927
and to satisfy the needs of a wide range of receivers with a minimal bandwidth overhead. Acknowledgements The authors thank the members of 3GPP SA1 and SA4 working groups for their efforts for new media codecs for LTE. References [1] 3GPP TS 26.235, Packet switched conversational multimedia applications; default codecs. [2] 3GPP TS 26.236, Packet switched conversational multimedia applications; transport protocols. [3] 3GPP TS 26.114 IP Multimedia Subsystem (IMS); multimedia telephony; media handling and interaction. [4] 3GPP TS 26.234 Transparent end-to-end packet-switched streaming service (PSS); protocols and codecs. [5] 3GPP TS 26.346 Multimedia broadcast/multicast service (MBMS); protocols and codecs. [6] 3GPP TS 26.237 IP multimedia subsystem (IMS) based packet switch streaming (PSS) and multimedia broadcast/multicast service (MBMS) user service; protocols. [7] 3GPP TS 26.140 Multimedia messaging service (MMS); media formats and codes. [8] 3GPP TS 26.141 IP multimedia system (IMS) messaging and presence; media formats and codecs. [9] 3GPP TR 26.935 Packet switched (PS) conversational multimedia applications; performance characterization of default codecs. [10] 3GPP TS 26.071 Mandatory speech CODEC speech processing functions, AMR speech codec; general description. [11] K. Järvinen, Standardisation of the adaptive multi-rate codec, in: Proceedings of the X European Signal Processing Conference (EUSIPCO), Tampere, Finland, September 2000. [12] 3GPP TS 26.171 Speech codec speech processing functions; adaptive multirate-wideband (AMR-WB) speech codec; general description. [13] B. Bessette, R. Salami, R. Lefebvre, M. Jelínek, J. Rotola-Pukkila, J. Vainio, H. Mikkola, K. Järvinen, The adaptive multirate wideband speech codec (AMRWB), IEEE Transactions on Speech and Audio Processing 10 (8) (2002) 620– 636. [14] 3GPP TS 26.290 Audio codec processing functions; extended adaptive multirate-wideband (AMR-WB+) codec; transcoding functions. [15] R. Salami, R. Lefebvre, A. Lakaniemi, K. Kontola, S. Bruhn, A. Taleb, Extended AMR-WB for high-quality audio on mobile devices, Communications, Magazine IEEE 44 (5) (2006) 90–97. [16] 3GPP TS 26.401 General audio codec audio processing functions; enhanced aacPlus general audio codec; general description. [17] 3GPP TR 22.813 Study of use cases and requirements for enhanced voice codecs in the evolved packet system (EPS). [18] M. Tammi, L. Laaksonen, A. Rämö, H. Toukomaa, Scalable superwideband extension for wideband coding, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, April 19–24, 2009, pp. 161–169. [19] C. Faller, F. Baumgarte, Binaural cue coding: a novel and efficient representation of spatial audio, in: Acoustics, Speech, and Signal Processing, 2002 (ICASSP’02) vol. 2, pp. 1841–1844. [20] ISO/IEC Moving Picture Expert Group, 14496-10, MPEG-4 Part 10, Advanced video coding, July 2007. [21] 3GPP TR 26.903 Improved video support for packet switched streaming (PSS) and multimedia broadcast/multicast service (MBMS) services.