Scalable Codec Architectures for Internet Video-on ... - Semantic Scholar

1 downloads 0 Views 87KB Size Report
For video servers, sequences need only to be encoded once. De- pending on the available bit-rate, the server selects more or fewer bits from the bit-stream to be ...
Scalable Codec Architectures for Internet Video-on-Demand Bernd Girod, Niko F¨arber, and Uwe Horn Telecommunications Laboratory University of Erlangen-Nuremberg Cauerst. 7, 91058 Erlangen, Germany girod,faerber,uhorn @nt.e-technik.uni-erlangen.de 

Abstract The heterogeneous structure of the Internet is a great obstacle for establishing real-time video services. Scalable video codecs, generating bit-streams decodable at different rates, have been proposed to address the heterogeneity problem. In this paper, we will review standard-compliant and non-compliant codec architectures for Internet videoon-demand. For compression based on the H.263 standard, we have developed a compatible architecture that allows to switch between preencoded bit-streams of different bitrates. This architecture provides excellent streaming performance for point-to-point communication scenarios that offer a low delay feedback channel. Then we will present a non-compliant fully scalable video codec based on a spatiotemporal resolution pyramid. This approach can encode embedded lower bit-rate layers at the same overall bit-rate as needed by H.263 single-layer coding and can also support multicasting. The complexity of both schemes is sufficiently low to allow software-only implementations of Internet video services. This is demonstrated by means of an implemented World Wide Web video server application.

1. Introduction Information systems and distributed applications for the Internet show a growing demand for real-time audiovisual services. Real-time audio services can already be found on several World Wide Web (WWW) servers as well as within the Internet MBone tools. For video services the heterogeneous structure of the Internet is still a great obstacle. Reservation protocols alone, e.g. RSVP (resource reservation protocol [14]), cannot solve the heterogeneity problem. Consider an Internet video server with sequences encoded at a predetermined bit-rate. If the bit-rate is sufficiently low, nearly everybody connected to the Internet can view the videos in real-time at the cost of very poor quality. But those who can connect to the server at higher speeds do not bene-

fit from higher bandwidth because the sequences have been encoded at a much lower bit-rate. On the other hand, if the sequences were encoded at higher bit-rates, those connecting to the server at lower speeds cannot view the videos in real-time any longer. Storing bit-streams for different bitrates is possible but requires certain precautions to allow switching between different bit-rates during a transmission. Scalable video coding has been proposed to address the above problem. A scalable video coder produces a bitstream, decodable at different bit-rates. It allows computation time and memory limited decoding on less powerful hardware platforms [5], and it can substantially improve the quality and acceptance of video services on the Internet. For video servers, sequences need only to be encoded once. Depending on the available bit-rate, the server selects more or fewer bits from the bit-stream to be sent to the receiver. For broadcast applications a layered multicast scheme can be used. The different bit-rates can be distributed into several layered multicast groups. Picture quality increases with the number of received multicast groups. For videoconferences with several participants, the RSVP protocol can be used for bandwidth allocation over each intermediate link. If necessary, the scalable bit-stream can be pruned easily at routers such that the allocated bandwidth can be utilized in the most efficient way. In this paper we propose two different approaches for Internet Video-on-Demand and compare their performance with H.263 simulcast. The comparison is based on a simulation scenario with two quality levels which is described in the following section. After comparing the resulting bitrates for a given target PSNR we finally provide simulation results for the transmission over a constant bit-rate channel.

2. Simulation scenario The simulation results in this paper were obtained under the assumption of an Internet Video-on-Demand situation. We require that the encoder can generate two different picture qualities such that in case of network congestion the

We compare the proposed architectures with simulcasting of two H.263 encoded bit-streams. Since we desire a maximum delay of 320 msec for switching from lower to higher quality, an I-frame has to be inserted in the CIF layer every 8 frames. I-frames in the QCIF layer are not needed as long as the QCIF H.263 decoder keeps decoding the low quality layer even if the higher quality layer is used for display. Note that this requires two running decoders at the receiver. H.263 coding simulations were carried out with mode decision according to TMN-5. No special coding options were used. Fig. 1 illustrates the used simulation model for simulcast and the dependencies of frames caused by prediction. The frame types are 'I' (intra) and 'Pt' (motion compensated, temporal prediction). Pt

4

12

x 10

QCIF

10

QCIF + CIF

8

6

4

2

0

0

10

20

30

40

50

60

70

80

90

time [1/30 s]

Figure 2. Simulcast: Transmitted bits during a switch from low to high quality.

3. H.263 simulcast

I

proach results in a fully scalable codec as proposed in the next section. bit/frame

higher quality bit-stream can be discarded. We further assume that the bandwidth needed for the transmission of the low bit-rate layer can be guaranteed during the whole transmission. The low quality layer is encoded in QCIF resolution at 15 fps, while the high quality layer is encoded in CIF resolution at 30 fps. For both layers the target PSNR is 34 dB which is achieved with only little variations for all investigated approaches. The sequences are encoded with fixed quantizers and no rate control. If the decoder wants to switch from lower to higher bandwidth during the transmission, the delay until pictures of better quality are displayed is an important design parameter. We require a maximum delay of 320 msec (i.e. every 8 frames) for the simulations in this paper. All results are based on the MPEG-4 test sequence SEAN, frames 1-96.

I

CIF 30 fps

4. Scalable coding Our approach to scalable video coding is based on a spatio-temporal resolution pyramid. This kind of a multiresolution decomposition was first proposed by Uz and Vetterli in 1991 [12, 13]. The simulation model used is shown in Fig. 3 and is very similar to simulcast except for the two additional frame types 'Pst' and 'Ps' which are described below. I

Pst

I

Pt

Pt

Ps

CIF 30 fps

QCIF 15 fps

Figure 3. Simulation model for the scalable codec.

QCIF 15 fps I

Pt

Figure 1. Simulation model for simulcast. Fig. 2 shows the transmitted bits during a switch from low to high quality. It can be seen that I-frames in the CIF layer require about 8 times as many bits as Pt-frames and therefore significantly contribute to the overall bit-rate. Because both bit-streams are encoded independently, no advantage is taken from the fact that the QCIF layer could be used to improve the coding efficiency of the CIF layer. This could be achieved by using interpolated frames from the QCIF layer to spatially predict frames from the CIF layer or by encoding side information more effectively. This ap-

A scalable codec superior to simulcasting has to exploit spatio-temporal redundancies of the pyramid decomposition by an efficient compression technique. In our approach we have combined low complexity downsampling and interpolation filters with highly efficient lattice vector quantization. Decoder complexity is sufficiently low to allow software-only implementations on today's PCs and workstations. Encoder complexity is mainly determined by motion estimation. For I-frames, the original frame is successively filtered and downsampled by a simple averaging filter with coefficients  , separately applied in horizontal and vertical direction. The lowest resolution layer is encoded by a DPCM







 

   





 



spatially from the interpolated lower resolution frame of the same temporal reference or temporally from the previous frame of the same resolution. Fig. 5 shows the transmitted bits during a switch from low to high quality. The performance of P-frames in the QCIF and CIF layer are very similar to H.263. Interestingly the sizes of Pst-frames and Pt-frames in the CIF layer are almost identical, indicating that no advantage can be taken from the presence of the QCIF layer. However, a significant gain can be obtained by replacing the I-frames in the CIF layer by spatially predicted Ps-frames. The size of those frames is reduced from approximately 80 kbit to 60 kbit respectively. 4

bit/frame

technique [11]. For all other layers a spatial prediction is formed by interpolating the lower resolution layer by a fil again applied horizontally ter with coefficients  and vertically. Spatially predicted frames are denoted 'Ps' and can also be used for any other type of lower resolution frames, e.g. Pt-frames. In the above simulation model Psframes are used to allow switching from the QCIF to the CIF layer and are inserted every 8 frames in the CIF layer. The residual prediction error quantizer uses an dimensional lattice vector quantizer (LTVQ) as shown in Fig. 4 [6]. We selected the -lattice since it is the best known lattice quantizer in eight dimensions [3]. Furthermore, in [1, 2] fast algorithms for nearest lattice point rounblock of neighboring ding are given. For encoding, a samples is mapped into an -dimensional vector. This vec where corresponds tor is scaled by a factor  to the quantizer step size in one-dimensional quantization. By varying , the bit-rate of the quantizer can be controlled. The scaled vector, a point in , is rounded to its nearest -lattice point by using the algorithm given in [1]. From the obtained lattice point an index is computed which then is transmitted to the decoder. The decoder can reconstruct the lattice point from the received index either by computation or by a simple table-lookup. By rescaling the reconstrucinput ted lattice point with , the finally reconstructed block is obtained.

12

x 10

10

QCIF

QCIF + CIF

8

6

4

2

0

0

10

20

30

40

50

60

70

80

90

time [1/30 s]

Figure 5. Scalable codec: Transmitted bits during a switch from low to high quality.

Figure 4. Encoding and decoding with lattice vector quantization (LTVQ).

Based on the scalable coder described above, we have implemented a software-only video server application called NetClip [9] and integrated it into the World Wide Web (WWW). The decoder is available as a Netscape plugin for various Unix hardware platforms [10]. The HTTP server machine has on its local disk several video clips which have been compressed by our scalable codec.

5. S-frame architecture Besides I-frames we are also using predicted frames in our coding scheme. Motion-compensated prediction is based on  blocks and works within each layer si milar to motion-compensated hybrid coders [7, 8]. Motion vectors are estimated and coded in a hierarchical way. Motion vectors found for a lower resolution layer are used to predict motion vectors in the next higher layer. Within predicted frames, we distinguish between temporal prediction (frame type 'Pt') and spatial-temporal prediction (frame type 'Pst'). Spatial-temporal prediction is only possible when a lower resolution frame with the same temporal reference is available. Then each block can be predicted either

 

In both of the above approaches the periodic transmission of I-frames (Ps-frames) significantly effect the overall bit-rate for the high quality bit-stream. Also note that I-frames (Ps-frames) do not provide any additional functionality during normal transmission of the CIF layer. In [4] we have therefore proposed a new architecture for the standard compatible storage and transmission of video that avoids the overhead of periodic I-frames (Ps-frames). According to this architecture both layers are encoded independently using only the most effective temporally predicted Pt-frames. A switch to the CIF layer is possible by dynamically assembling the bit-stream at the video server

I

are only transmitted during a switch from the QCIF to the CIF layer. Therefore their influence on the average transmitted bit-rate is very low. For storage, however, S-frames have to be considered every 8 frames. Therefore we need to distinguish between the effective bit-rate for storage and transmission. Fig. 7 illustrates that during a switch from low to high quality a high peak in bit-rate is encountered. After the switch, however, the S-frame architecture provides excellent streaming performance. 4

bit/frame

whenever a switch is requested from the client. In between the last frame of the QCIF layer and the first frame of the CIF layer an especially encoded “Switch-frame” (S-frame) is inserted that is also preencoded and stored for each position a switch shall be supported, i.e., every 8 frames. Because this approach is feedback driven, it is only feasible in a point-to-point communication scenario such as Internet video-on-demand. In contrast to the above two approaches it would therefore not be suitable for multicasting or broadcasting. Furthermore it requires a reasonable low delay on the feedback channel to allow a quick response time and interactivity. Under those conditions, however, it provides excellent streaming performance as will be shown below. Fig. 6 shows the assumed simulation model where 'S' indicates the switch-frame.

12

x 10

10

QCIF

QCIF + CIF

8

6

Pt

CIF 30 fps

4

2

S 0

0

10

20

30

40

50

60

70

80

90

time [1/30 s] I

Pt

QCIF 15 fps

Figure 6. Simulation model for S-frame architecture. The purpose of S-frames is to resynchronize the frame buffers at the encoder and decoder before any following Ptframes of the CIF layer are transmitted. Therefore, instead of encoding the original frames, the reconstructed frames from the CIF layer are used for encoding the S-frames. As a reference frame the last reconstructed frame of the QCIF layer is used. With this special encoding strategy, the content of the frame buffer in the CIF layer can be approximated with increasing accuracy by decreasing the quantization parameter Qp. For all simulations in this paper we use Qp=3, which results in a remaining loss of PSNR of less than 0.2 dB compared to the case that the CIF layer is decoded without any switch. After the switch, the loss of picture quality decreases over time. Note that an S-frame does not require a new frame type but can be encoded as an H.263 compatible Pt-frame. However the change in spatial resolution requires the use of the optional coding mode 'Reference Picture Resampling' that is part of a series of new options for H.263 that have recently be adopted by the ITU-T under the working term 'H.263+'. Because an S-frame has to approximate the reconstructed frame from the CIF layer very closely, it requires a higher bit-rate than an I-frame. On the other hand S-frames

Figure 7. S-Frame architcture: Transmitted bits during a switch from low to high quality.

6. Comparison of coding performance Table 1 summarizes the coding performance of the three approaches. Note that the scalable codec can encode both layers at a lower rate than required by H.263 for the CIF layer alone. For the S-frame architecture the given rate does not include any S-frames since those are only transmitted during a switch. The rate that is needed to store the bit-streams including all S-frames is provided in brackets. The results indicate that the S-frame architecture is advantageous, if the network bit-rate rather than the storage capacity is the limiting factor. The peak in bit-rate may be a problem for packet networks and may require buffering as described below. rate QCIF

Simulcast Scalable S-frame

[kbps] 70.1 66.0 70.1

rate QCIF +CIF [kbps] 509.8 418.8 231.3 (635.9)

PSNR QCIF

PSNR CIF

[dB] 34.2 34.3 34.2

[dB] 34.3 34.3 33.9

Table 1. Summary of coding performance.

5

x 10

Simulcast

3 2 1

0 5 x 10

100

200

300

400

500

600

700

0 5 x 10

100

200

300

400

500

600

700

0

100

200

300

400

500

600

700

Scalable

3 2 1 0 3

S-Frame

buffer level [bit]

0

2 1 0

QCIF QCIF + CIF

time [1/30 s]

Figure 8. Comparison of buffer levels for the transmission over a constant bit-rate channel.

In addition to the above results we have simulated the transmission over a constant bit-rate channel. The bit-rate has been chosen in between the low and high video bit-rates, such that switching between the two layers is necessary to avoid buffer over- and underflow. The simulations are based on a channel bit-rate of 200 kbps and a buffer size of 300 kbit. To simulate the transmission of a longer sequence, the encoded frames have been transmitted repeatedly. Therefore I-frames are transmitted every 96 frames for all approaches. Fig. 8 shows the simulation results. Because the performance of all approaches in the QCIF layer is very similar, the time to empty the buffer after buffer overflow is about the same in all three simulations (white areas). After buffer underflow the high rate is transmitted and CIF frames can be displayed as long as the increased rate does not cause buffer overflow. This time period (gray areas) is longest for the S-frame architecture and shortest for simulcasting.

7. Conclusions The heterogeneity of the Internet remains a great obstacle for providing reliable video services. Scalable video codecs can substantially improve the quality of video services for many applications. The three approaches to scalable video coding that have been compared in this paper show very similar performance in the QCIF layer where only temporal prediction is used. In the CIF layer the proposed scalable codec performs significantly better than H.263 simulcast. Both layers can be encoded at a lower rate than required by H.263 for the CIF layer alone. This gain

is mainly achieved by using spatially predicted frames instead of I-frames to allow switching from low to high quality. The third approach is limited to point-to-point communication scenarios that offer a low delay feedback channel. A switch from low to high quality is accomplished by inserting especially encoded switch-frames that cause a high peak in bit-rate. After the switch, however, the approach offers excellent streaming performance. The S-frame architecture is fully standard compatible but requires increased storage capacity.

References [1] J. H. Conway and N. J. A. Sloane. Fast quantizing and decoding algorithms for lattice quantizers and codes. IEEE Trans. on Information Theory, IT-28(2):227–232, Mar. 1982. [2] J. H. Conway and N. J. A. Sloane. A fast encoding method for lattice codes and quantizers. IEEE Trans. on Information Theory, IT-29(6):820–824, Nov. 1983. [3] J. H. Conway and N. J. A. Sloane. Sphere Packings, Lattices and Groups. Springer, 1988. [4] N. F¨arber and B. Girod. Robust H.263 compatible video transmission for mobile access to video servers. In Proc. ICIP'97, Santa Barbara, USA, Oct. 1997. [5] B. Girod. Scalable video for multimedia systems. Computers & Graphics, 17(3):269–276, 1993. [6] B. Girod, F. Hartung, and U. Horn. Subband image coding. In A. N. Akansu and M. J. Smith, editors, Subband and Wavelet Transforms, chapter 7, pages 213–250. Kluwer Academic Publishers, Norwell, MA, 1995. [7] B. Girod, U. Horn, and B. Belzer. Scalable video coding with multiscale motion compensation and unequal error protection. In Y. Wang, S. Panwar, S.-P. Kim, and H. L. Bertoni, editors, Multimedia Communications and Video Coding, pages 475–482. Plenum Press, New York, Oct. 1996. [8] U. Horn and B. Girod. Performance analysis of multiscale motion compensation techniques in pyramid coders. In Proc. ICIP'96, volume III, pages 255–258, 1996. [9] U. Horn, B. Girod, and B. Belzer. Scalable video coding for multimedia applications and robust transmission over wireless channels. In 7th International Workshop on Packet Video, pages 43–48, Brisbane, Mar. 1996. [10] U. Horn and J. Weigert. Netscape pyramid decoder plugin. www4.informatik.uni-erlangen.de/Projects/netclip/, Mar. 1997. [11] A. N. Netravali and B. G. Haskell. Digital Pictures. Plenum Press, New York, 1988. [12] M. Uz, M. Vetterli, and D. LeGall. Interpolative multiresolution coding of advanced television with compatible subchannels. IEEE Trans. on Circuits and Systems for Video Technology, 1(1):86–99, Mar. 1991. [13] M. Vetterli and K. Uz. Multiresolution coding techniques for digital television: A review. Multidimensional Systems and Signal Processing, 3:161–187, 1992. [14] L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala. RSVP: A new resource reservation protocol. IEEE Network, 7(5):8–18, Sep 1993.