to be received error-free within a maximum allo- wed delay between ... ble bandwidth for error-free transmission decre- .... the result on the workstation monitor.
Scalable Video Coding for Multimedia Applications and Robust Transmission over Wireless Channels Uwe Horn*, Bernd Girod*, and Ben Belzer** *Telecommunications Institute University of Erlangen-Nuremberg, Germany **Electrical Engineering Department University of California, Los Angeles, United States
ABSTRACT:
We present a scalable video codec with low decoder complexity suitable for realtime multimedia video applications. For two prototype applications we show how scalability can be utilized for a software-only Ethernet video server as well as for robust transmission over wireless channels. Quality of received video degrades gracefully with the available resources required for complete and error free reception of data.
1. INTRODUCTION Data loss is a problem common to transmission of real-time video data over packet networks as well as over wireless channels. Data packets have to be received error-free within a maximum allowed delay between 100 ms and 2 sec depending on the application. On networks which can only oer a best-eort quality of service, packet loss is very likely to occur. Especially within multimedia applications data loss may also be caused due to limited system resources like CPU time and memory [1]. For digital TV broadcasting the situation is not dissimilar. Bit error rates up to 10?2 may arise over periods of several seconds especially within mobile environments. Data loss can be avoided if channel coding methods with an appropriate error protection scheme are used for digital transmission. The increase in bandwidth can be kept low if channel conditions are known a priori and remain stable over reasonable long periods of time. This does not apply to broadcast transmission over wireless channels since many receivers exist and each has dierent channel statistics. Today's video coding standards are very sensitive to data loss. Typically, a hybrid coding approach with motion-compensated prediction is used to achieve high compression gains [2] [3]. Due to motion-compensation, incorrectly decoded parts of a frame tend to degrade the whole picture quality for several seconds as long as no intra-coded information is received. To overcome those problems we propose a scalable video codec that we have used for video transmission over Ethernet as well as for simula-
tions of digital TV broadcasting in environments with high bit error rates. The coder generates a bitstream which can be decoded at several bit rates starting from 64 kbit/s up to 4 Mbit/s. For digital broadcasting we propose to combine scalability with a forward error correction scheme based on unequal error protection. This allows graceful degradation of picture quality as available bandwidth for error-free transmission decreases instead of a sudden breakdown. The partial decodability feature may also be extremely useful for low complexity transcoding of multicast transmissions within hybrid networks [4].
2. DESCRIPTION OF THE SCALABLE CODEC Our approach is based on a spatio-temporal resolution pyramid [5]. Fig. 1 shows the spatiotemporal resolution pyramid we used as basis for the applications described in the following sections [6, 7]. As can be seen, layers 2 and 1 roughly correspond to QCIF and CIF resolution, respectively. The highest resolution layer contains an interlaced video signal in its full spatial and temporal resolution according to the ITU-R 601-4 standard [8]. On the right spatial and temporal resolutions of all layers are listed. The advantage of the chosen pyramid layout is that from the same bitstream progressive image formats as needed on computers as well as interlaced image formats used for digital representation of TV signals can be decoded. Note, that the number of samples to code increases by about 53%. For ecient encoding of this oversampled signal representation predictive pyramid coders as shown in Fig. 2(a) and (b) can be used. Without loss of generality we consider initially only two layer pyramids. Fig. 2(a) shows an open-loop coder whereas Fig. 2(b) shows a closed-loop coder which results from Fig. 2(a) by including noise feedback. Common to both approaches are lters for downsampling and interpolation. The decoder for both coder structures is identical (Fig. 2(c)). Critically sampled subband coding sche-
Quantization
I
P
B
B
B
B
P
P
B
B
B
B
P
P
B
B
B
B
I
Layer 3: 88x60 3.33 Hz Layer 2: 176x120 10 Hz Layer 1: 352x240 60 Hz Layer 0: 704x240 60 Hz
Lowpass filter and downsampling
Q
(a) Open-loop coder -
+
Image
Q Q
(b) Closed-loop coder
P
Figure 1: Spatio-temporal pyramid setup
Image
mes [9, 10, 11] need carefully designed lters to achieve perfect reconstruction and aliasing cancellation. In pyramid decompositions, lters can be designed more freely according to aspects like subjective image quality and complexity. Another advantage is that multiscale motion compensation can be easily incorporated as described in [7]. Filters we us in our simulations have impulse responses [1=2 1=2] prior to downsampling and [1=4 3=4 3=4 1=4] after upsampling. For twodimensional ltering they are applied separately in the horizontal and vertical direction. Both lters can be implemented without multiplications. Particularly, interpolation at the decoder is very inexpensive. For encoding, the input signal is rst ltered and then downsampled. The coarse resolution layer is quantized and transmitted. The openloop coder (Fig. 2(a)) uses the coarse resolution signal prior to quantization to generate a prediction for the ne resolution layer. With noisefeedback (Fig. 2(b)) the signal after quantization is interpolated and used as prediction. In both cases the resulting prediction error is quantized and transmitted to the decoder where it it is used to reconstruct the ne resolution layer. The bitrate in each layer can be controlled independently by the corresponding quantizer. In the given example two bit-streams are transmitted. At a low bitrate only the coarse resolution layer is decoded. Decoding of the ne resolution layer needs a higher bitrate for additional transmission of the interpolation error. Using more than one quantizer for encoding of one frame leads to the problem of optimum bit allocation. Given a certain amount of bits to be spent for one frame, how does one allocate these bits among the various layers? By using a Lagrange approach, it can be shown that the allocation is optimal if it meets the equal distortionrate slope condition [12, 13]. As shown by Ramchandran et al. [13], and also veri ed by our own experiments the closed-
Upsampling and interpolation filter
+ Q coarse resolution
(c) Common decoder
+
fine resolution
Figure 2: Two dierent pyramid encoders (a), (b) with corresponding decoder (c) loop approach is less sensitive to suboptimal solutions. Additional constraints like comparable picture qualities in all layers can be added without noticeable penalty. Therefore, we decided to use a closed-loop coder for all experiments described in the following. Automatic bitrate control with low complexity is still a matter of ongoing research. In our simulations we compute an optimum bit allocation for a given rate only for the rst frame and then apply the result to all following frames. The predictive coders described above eciently exploit redundancy in the spatial domain. Further improvement can be gained by applying motion compensation (MC) which also exploits redundancy in the temporal domain. For partial bit-stream decodability and ecient compression at all bitrates it is essential that MC is used within all spatial resolution layers which is commonly referred to as multiscale motion compensation (MSMC) [14, 7]. The group of picture structure we have chosen for our experiments can be seen at the bottom of Fig. 1. Signals within the pyramid are quantized by fast E8 -lattice vector quantization [15]. The most probable vectors are kept in a small codebook [16]. They can be decoded by a simple tablelookup.
Synchronization marks are placed within the bitstream right before each row of blocks. Blocksizes are 4x4 for layer 3, 8x8 for layers 2 and 1 and 16x8 for elds in the highest resolution layer.
2.1. Coding performance and softwareonly decoding Fig. 3 shows the performance of the closed-loop codec for four test sequences. For each sequence PSNR values for decoding at four dierent bitrates corresponding to the four resolution layers are shown. PSNR for the original image resolution was measured after interpolation of the decoded layer to the nest resolution layer by the [1=4 3=4 3=4 1=4] lter. PSNR[dB] 30.00
Football Flowergarden
25.00
Tabletennis
Mobile & Calendar 20.00
15.00 0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Bitrate (Luminance) [MBit/s]
We note that much of the decoding time is spent for bitstream parsing. The decoder needs 16.6 sec for just parsing 5 sec of compressed video in ITU-R 601-4 resolution. After optimizing variable length code (VLC) decoding this could be reduced to 12.1 sec, which is still a factor of two slower than real-time.
3. A SOFTWARE-ONLY ETHERNET VIDEO SERVER Based on the source coder described above, we implemented a software-only video server application called NetClip. As a testbed we used our existing Ethernet with about 70 attached workstations and PCs. We also tested transmission over a wireless Ethernet link built by two MeshNet2 repeaters with a maximum achievable transmission rate of 1 Mbit/s. The server machine (a SparcStation20) has access to several compressed video clips on its local disk. For transmission, a packet assembler parses the compressed video data, puts them into packets and sends the packets to the remote site. The receiving process disassembles the received packets, decodes the contained data and displays the result on the workstation monitor.
Figure 3: Coding performance
Decoded frame rate Real-time frame rate
4.7 MBit/s
2.0 MBit/s
330 KBit/s
64 KBit/s
Typical bit rates with increasing spatiotemporal layer resolution are 64 kbit/s, 330 kbit/s, 2.0 Mbit/s and 4.7 Mbit/s. The not fully optimized software-only implementation of the decoder is able to decode and display layers 3 and 2 in real-time on a Sun SparcStation5 as well as on an SGI Indigo2 (Fig. 4). Decoding of the full resolution layer would need a ten times faster CPU. It can be seen that scalability allows computation limited decoding, a desired feature for multimedia video applications [1]. Figure 5: NetClip application
10 1
Real-time
0.1
Sparc 5 Indigo2 L3 L2 Base QCIF
L1 CIF
L0 CCIR 601
Figure 4: Real-time performance of the decoder
A screenshot of the running client application can be seen in Fig. 5. The control window on the right lets the user choose among the available clips by clicking the topmost button. Two image qualities corresponding to the base and QCIF resolution layer can be selected. The display size may be doubled by further interpolation of the decoded picture by choosing 'Large' as display size. The control window further displays the current bitrate as well as the packet loss rate.
We use the RTP protocol [17] for session control and data transmission. For Ethernet transmission the packetsize was chosen as 1024 bytes. The packet assembler tries to start each packet at a synchronization mark within the bitstream. This allows faster resynchronization at the receiver when packet loss occurs [18]. Synchronization between packets and bitstream applies only if the introduced overhead stays below 25% of the packet size. The packet assembler also puts priority information into the RTP packet header; this information is derived from the layer the packet belongs to. The video server is set up to introduce a minimum of 8 ms inter-packet delay which corresponds to a peak transmission rate of 1 Mbit/s.
Header Layer 3 Layer 2 Layer 1 Layer 0
I-Picture P-Picture B-Picture 1/2 2/3 2/3 1/2 4/7 2/3 2/3 4/5 2/3 1/1 1/1 1/1
Table 1: RCPC Coderates used for unequal error protection Packet loss can be easily detected from the packet's sequence number. Error concealment uses the already decoded information from the next lower resolution layer in place of the missing packet. If a packet from the lowest resolution layer is lost, information for the missing packet is taken from the preceding frame.
4. DIGITAL TV BROADCASTING 0
200 400 600 Packet number
a) Low CPU load
800
0
200 400 600 Packet number
800
b) Higher CPU load
Figure 6: Packet loss due to decoder buer over ow caused by CPU load Within our local network the maximum achievable picture quality is mainly limited by the real-time performance of the decoder. Until up to bitrates of 1 Mbit/s there is no signi cant delay or packet loss caused by Ethernet transmission. Nevertheless packet loss may happen due to limited CPU speed by buer over ow at the receiver. This type of packet loss is comparable to congestion loss in gateways. In our environment, it occurs whenever the transmission rate is above 300 kbit/s and double display size is selected such that the decoder has to perform a twodimensional interpolation right after a picture is decoded. Typical packet loss statistics caused by buer over ow are shown in Fig. 6 for low and high CPU load. Each vertical line stands for the loss of one packet. The amount of overall packets sent was 670 and corresponds to nearly 20 seconds of video. The loss of consecutive packets at start of the transmission common to both cases is caused by time consuming user interface operations like creation of a display window right after reception of the sequence parameters. The same results are achievable with a wireless 1 Mbit/s Ethernet link (MeshNet2) as part of the network connection.
To achieve graceful degradation in high bit error rate environments like digital TV broadcasting we can combine the scalable source coder with an unequal error protection scheme by using rate-compatible punctured convolutional codes (RCPC codes) [19] of dierent code rates. Packet size in this simulation is 360 bits. According to the importance of the data contained in the encoded and packetized bitstream, we apply coderates as shown in Table 1. The entry K=N means that K input bits are coded with N output bits. As more output bits are generated for one input bit, error protection increases. We compare this unequal error protection scheme with an equal error protection where all packets are protected with a coderate of 4=5. In both cases the increase in bit rate is 26% due to error protection. Simulation results for ve dierent bit error rates on a binary symmetric channel (BSC) can be seen in Fig. 7. Average PSNR over the whole sequence is compared to bit error rate. As expected unequal error protection performs better as bit error rate increases. At a bit error rate of 10?2 decoding of the equal error protected bitstream became impossible while unequal error protection still give a PSNR of 20 dB. The bitstream at the output of the channel decoder may still contain corrupted bits which cannot easily be detected. Residual bit errors within a slice are detected by illegal code words or at synchronization marks. Error concealment replaces the whole slice by decoded information taken either from the lower resolution layer or, if in the base layer, from the corresponding slice of the preceding frame.
packet networks. In Proc. 6th International Workshop on Packet Video, Portland, Oregon, Sep. 1994.
Flowergarden
PSNR (dB) 30
Unequal Error Protection
[5] M.K. Uz, M. Vetterli, and D.J. LeGall. Interpolative multiresolution coding of advanced television with compatible subchannels. IEEE Trans. on Circuits and Systems for Video Technology, 1(1):86{99, Mar. 1991.
20
10
Equal Error Protection decoding impossible 0
10-4 10-3 bit error rate
5*10-3 10-2
Figure 7: Comparison between UEP and EEP
5. CONCLUSION We presented a scalable video codec based on a spatio-temporal pyramid with low decoder complexity. For two packet video applications, we have shown that scalability allows a robust transmission of compressed video data over Ethernet as well as over wireless channels. In our implemented application of an Ethernet video server we observed that the most probable source of packet loss was buer over ow in case of limited CPU time for real-time decoding. In the digital TV broadcast scenario, scalability can be ef ciently combined with unequal error protection and can also be utilized for error concealment. Future work will include simulations for dierent loss models concerning transmission over ATM (asynchronous transfer mode) networks.
6. ACKNOWLEDGEMENT The work of Niko Farber in implementing the RCPC encoding and decoding routines is gratefully acknowledged.
References [1] B. Girod. Scalable video for multimedia systems. Computers & Graphics, 17(3):269{ 276, 1993. [2] CCITT. Recommendation H.261, Video Codec for Audiovisual Services at p x 64 kBit/s, 1990. [3] ISO/IEC. Document 13818-2, Generic Coding of Moving Pictures and associated Audio, Part 2: Video, Recommendation H.262, Draft International Standard, Marz 1994. [4] T. Turletti and J.-C. Bolot. Issues with multicast video distributions in heterogenous
[6] U. Horn and B. Girod. Pyramid coding using lattice vector quantization for scalable video applications. In Proc. PCS '94, Sacramento, CA, 1994. [7] U. Horn, B. Girod, and B. Belzer. Scalable video coding with multiscale motion compensation and unequal error protection. In Proc. International Symposium on Multimedia Communications and Video Coding, New York, NY, Oct 1995. [8] ITU-R. Encoding parameters of digital television for studios. Recommendation 601-4, 1982. [9] A.N. Akansu and R.A. Haddad. Multiresolution signal decomposition. Academic Press, San Diego, 1992. [10] J.W. Woods (ed.). Subband Image Coding. Kluwer Academic Publishers, Boston, 1991. [11] J. Katto and Y. Yasuda. Performance evaluation of subband coding and optimization of its lter coecients. Journal of Visual Communication and Image Representation, 2(4):303{313, December 1991. [12] Y. Shoam and Allen Gersho. Ecient bit allocation for an arbitrary set of quantizers. IEEE Trans. on Acoustics, Speech, and Signal Processing, 36(9):1445{1453, Sep. 1988. [13] K. Ramchandran, A. Ortega, and M. Vetterli. Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders. IEEE Trans. on Signal Processing, 3(5):533{545, Sep. 1994. [14] T. Hanamura, W. Kemeyama, and Tominaga H. Hierarchical coding scheme of video signal with scalability and compatibility. Signal Processing: Image Communication, 5(1-2):159{184, Feb. 1993. [15] J.H. Conway and N.J.A. Sloane. A fast encoding method for lattice codes and quantizers. IEEE Trans. on Information Theory, IT-29(6):820{824, Nov. 1983.
[16] B. Girod, F. Hartung, and U. Horn. Subband image coding. In Ali N. Akansu and Mark J.T. Smith, editors, Subband and Wavelet Transforms, chapter 7. Kluwer Academic Publishers, Norwell, MA, 1995. [17] H. Schulzrinne and S. Casner. RTP: a transport protocol for real-time applications. INTERNET-DRAFT, Jul. 1994. [18] M. Leditschke and M. Biggar. Slice based approaches to improving the error robustness of MPEG-like compressed video bitstreams. In Proc. 6th International Workshop on Packet Video, Portland, Oregon, 1994. Sep. [19] J. Hagenauer. Rate-compatible punctured convolutional codes (RCPC codes) and their applications. IEEE Trans. on Communications, COM-36:389{400, Apr. 1988.