Temporal Scalability Using P-pictures For Low-latency ... - IEEE Xplore

0 downloads 0 Views 371KB Size Report
layered codec concept. This is generally realized by using bi-directionally predicted pictures (B-pictures), which use both an earlier and a subsequent P-picture ...
TEMPORAL SCALABILITY ‘ILJSING P-PICTURES FOR LOW-LATENCY APPLICATIONS Stephan Wenger stewe@,cs.tu-berlin.de Technische Universitat Berlin

Sekr. FR 6-3, Franklinstr. 28-29 D-10 587 Berlin, Germany Tel.: +1-415-297-2063 Fax.: 4-49-30-314-25156 Abstract: Many of the newer video coding/compression standards, including MPEG-4 and H.263+ support some form of temporal scalability as part of their layered codec concept. This is generally realized by using bi-directionally predicted pictures (B-pictures), which use both an earlier and a subsequent P-picture as reference (anchor) pictures. This paper introduces a new form of temporal scalability employing only kpictures. This mechanism results in improved real-time behavior (particularly lower latency) and a more flexible layering structure, a t the cost of less efficient coding. The P-picture based temporal scalability mechanism is particularly useful for interactive multimedia communication on networks that offer several independent transport streams (possibly with different quality of service), but have sub-optimal real-time characteristics. Typical applications include both the Internet and Some forms of mobile commttnication. The 1998 version of H.263 (known as N.263+ in both academia and inijustry) offers a mechanism t o support P-picture scalability within the bit-stream through the Reference Picture Selection mode [I] 121. Other video coding standards, including those of the MPEG family, require slight modifications to the decoder in order to support the proposed mechanism.

The concept of a layered codec is well known E37 [4] and recent video coding standards have incorporated support for such mechanisms. A layered video codec produces a representation of the video sequence at various resolutions and qualities. This allows for compatible and robust representations. A particular representation can be obtained by decoding a particular subset of the coded data. Typically, each representation is referred to as a layer. In a hierarchical scalable coding scheme, each layer is coced with respect to lower layer representations. These layers can then be transmittcd in separate data streams. The base layer is the only self-contained representation of the sequence and is usually low quality, low resolution, and low bit-rate. Enhancement layers code the difference information or error between the base layer and a higher quality or resolution representation of the video sequence. Three forms of scalability are widely used:

Temporal enhancement layers increase the frame-rate of the reproduced video by coding additional pictures. Spatial enhancement layers raise the spatial resolution of the coded picture by coding difference information between an up-scaled reference layer picture and a higher resolution representation of the original picture. Decoding the base layer produces a low spatial resolution representation, while decoding both the base and spatial enhancement layers produces a higher resolution representation. Signal-to-noise-ratio enhancement layers (SNR scalability) allow the transmission of additional precision of DCT coefficients (and/or motion vectors), thus leading to a higher quality reproduced image and consequently to an increase of the SNR of that reproduced image.

All different enhancement layer mechanisms can be used in combination, although some restrictions exist in the current standards. In H.263+, for example, it is possible to use a spatial enhancement layer to increase the spatial resolution of the representation. Using this higher resolution representation, temporal scalability can be employed to produce a higher frame-rate. However, it is not permitted to use the pictures of a temporal enhancement layer as a reference for subsequent SNR or spatial enhancement layers. This paper focuses on temporal scalability only. It introduces a new form of temporal scalability employing P-pictures. The following section describes this Ppicture scalability and distinguishes it fiom usual B-picture scalability, using H.263+ as an example. Simulation results for both are presented next. The final section describes the applicability of P-picture scalability to other video coding standards implementing inter picture prediction.

THE CONCEPT OF P-PICTURE BASED TEMPORAL SCALABILITY When implementing a layered codec based on H.263+, the common solution is to employ H.263+’s Temporal, Spatial, and SNR Scalability optional mode (Annex 0). In this mode, temporal scalability is realized through bi-directionally predicted pictures (B-pictures). B-pictures use the previous and the subsequent P-picture (or I-picture) as their anchor pictures. Figure 1 illustrates a base layer that is decodable at 10 fps and an B-picture enhancement layer that is decodable at 30 fps.

Enh, 30 f p s

Figure 1: Base and temporal enhancement layer.

As stated above, H.263+ does not permit the B-pictures of the enhancement layer

to be used as reference pictures for subsequent layers. Furthermore, with this mechanism it is necessary to decode two base layer pictures before it is possible to decode the B-pictures. Therefigre, the overall latency of the base/enhancement layer combination and of the base layer itself are similar, although bits were spent in the enhancement layer to increase temporal resolution. The advantage of temporal scalability employing B-pictures is the comparably small size of Bpictures, leading to a high coding efficiency in the enhancement layer. This is in part due to the enhanced (bi-directional) prediction mechanisms for B-pictures. Additionally, a higher quantizalion step size is often used for the enhancement layer B-pictures compared to the base-layer P-pictures. This is possible because of the restriction that B-pictures not be used as references whatsoever. The result is high coding efficiency for the enhancement layer B-pictures and little noticeable impact on the reproduced video quality. The proposed P-picture based temporal scalability mechanism is illustrated in Figure 2. The base layer is a usual sequence of P or I-pictures, as in the example above. The enhancement layer, liowever, consists only of P-pictures. In addition, the enhancement layer now uses previous pictures of the enhancement layer, as well as base layer pictures, as reference pictures. Two scenarios are shown: 0 a base layer that is decodable at 15 f p s with an enhancement layer that is decodable at 30 fps, and 0 a base layer that is decodable at 10 f$s with an enhancement layer that is decodable at 30 fps. Note that the proposed mechanism can also employ a higher quantizer step-size in the enhancement layer P-pictures to increase coding efficiency with little noticeable impact. This is a particularly useful technique for the last picture of each enhancement layer sub-sequence. Base, 15 fps

x~

b

Enh, 30 fps

Figure 2: Principle of P-picture temporal scalability Employing P-pictures for tempora I scalability is not an explicitly defined optional mode in H.263+, or in any of the other popular video coding standards. However, in H.263+ it is possible to use h e Reference Picture Selection optional mode

(Annex N) without a back channel to generate a standard-compliant bit stream containing scalable P-picture information. This optional mode allows, for each predicted picture, one of several recently transmitted pictures to be selected for use as the reference picture, as opposed to simply allowing the most recent picture to be used as the reference picture. This is signaled in the picture header by coding the temporal reference of the selected reference picture. An H.263+ decoder supporting this optional mode must store several previous reference pictures along with their associated temporal references. The proposed P-picture based temporal scalability mechanism provides several advantages over the traditional B-picture based temporal scalability mechanism. First, the latency is reduced when spending bits for the enhancement layer. Since a P-picture can be decoded without backward prediction, the enhancement layer Ppictures can be decoded as soon as they are transmitted. At an enhancement layer frame rate of 30 f p s , and a base layer frame rate of 10 fps, the overall latency can be 66 milliseconds shorter than using B-frame scalability (33 ms instead of 100 ms). Second, P-fiames can be used as reference frames for subsequent enhancement layers. This provides more flexibility in the design of a layered codec. Third, error resilience is improved, as it is by using B-picture based temporal scalability. Any damage to an enhancement layer P-picture will not affect prediction starting from the next reference layer P-picture. The main disadvantage is the inferior coding efficiency of P-pictures relative to B-pictures. This reduced coding efficiency in the enhancement layer results in a slight increase in bit rate.

SIMULATION RESULTS Using the University of British Columbia’s H.263+ reference codec and our implementation of the Reference Picture Selection optional mode (Annex N), both B-picture based temporal scalability (using Annex 0) and P-picture based temporal scalability (using Annex N) are simulated. A base layer decodable at 10 fps and an enhancement layer decodable at 30 f p s are generated. Fixed quantizer step sizes of 13 and 16 are used for the base and enhancement layers respectively. The input sequences Foreman, Coastguard and Paris, at QCIF resolution, are employed. Several improved coding-efficiency oriented optional modes, Advanced Prediction mode (Annex F), Advanced Intra coding mode (Annex I), Deblocking Filter mode (Annex J) and Modified Quantization mode (Annex T) are used. This configuration produces representations at bit rates typical for Internet scenarios employing layered codecs. The resulting picture quality is far beyond what is achievable using current Internet applications (e.g. the popular Mbone tools). The picture quality for B-picture based scalability and the proposed P-picture based scalability is comparable, with a PSNR difference of less than 0.1 dB. This result is expected as the same quantizer values are used for both sets of

simulations. Therefore, we compare the different layering mechanisms based on coding efficiency, using the data in Table 1. The first column shows the bit rate for a non-layered representation decodable at 10 fps. The second column shows the bit rate for a non-layered representation decodable at 30 fps. This bit-rate can be considered as the optimum fix full temporal resolution (i.e. 30 f p s ) P-picture based coding. The third and fourth columns show the bit rate of the enhancement layers for layered representations decodable at 30 f p s using B-picture based temporal scalability and P-picture based temporal scalability respectively. The fifth and sixth columns show the total bit rate of the base and enhancement layers combined for layered representations decodable at 30 f p s using B-picture based temporal scalability and P-picture based temporal scalability respectively. This data provides an indication 01’ the network load imposed by the different mechanisms. Of particular interlsst are columns 4 and 6, which illustrate the bit rate overhead of P-picture based temporal scalability as compared to B-picture based temporal scalability.

83.9

119.1

-

42% 73.7 79.2

89.5 21% 86.4 9%

Picture types: p: base layer P, P: e’nhancementlayer PB: z hancement, ayer B Table 1: Performance of the P-picture temporal scalability scheme relative to B-. picture scalability The data shows a sequence-dependent total bit rate increase of 9% to 42% when IPpicture based temporal scalabiliQ is employed, as compared to B-picture based temporal scalability. The associatcd enhancement layer bit rate increase is 38% to 139%. For many applications, this increase in bit rate is tolerable, given that the benefits of a scalable bit stream ,Ire provided with significantly reduced latency using the proposed P-picture baseo temporal scalability mechanism.

USING P-PICTURE BASED TEMPORAL SCALABILITY OUTSIDE OF H.263+ H.263+ offers a mechanism for P-picture based temporal scalability through the Reference Picture Selection optional mode. The transport layer can then create a single P-picture based scalable bit stream from multiple transport streams (each of them conveying one layer) by using the temporal reference as the ordering criterion and concatenating the coded pictures. Unfortunately, other current video coding standards lack such a mechanism. This section provides a brief overview of how

to circumvent such limitations and implement a P-picture based temporal scalability mechanism in a non-H.263-k framework. In particular, the MPEGfamily of ISO/IEC standards and ITU-T H.261 and H.263 version 1 are to be mentioned. Using these standards, it is impossible to generate a standard compliant data stream that combines a base layer and a P-picture based temporal scalability layer without performing a complete decoding and re-encoding of the video sequence. However, it is possible to modify a decoder such that not only the most recently decoded Ppicture is stored but also several recently decoded P-pictures are stored in a queue. The decoder can then use these stored P-pictures in the following manner: 0 If the current picture belongs to the base layer, i.e. if it was conveyed over a base layer virtual transport channel, then use the most recently decoded base layer P-picture as a reference. This picture can easily be flagged in the decoded P-picture queue. 0 If the current incoming picture belongs to the P-picture enhancement layer, i.e. if it was conveyed over an enhancement layer virtual transport channel, then use the most recently decoded P-picture (from any available source channel) as the reference. The corresponding encoder operates in a similar manner: 0 The base layer is coded in the usual manner, performing inter-picture prediction based on reference pictures from the low temporal resolution representation only. 0 The enhancement layer is coded using the most recently coded P-picture as a reference. We have implemented such a mechanism for a real-time commercial H.26 1 codec. The results are comparable to those presented above for the H.263+ case.

REFERENCES [l] ITU-T Recommendation H.263 (1998): Video Coding for Low Bit Rate Communication, ITU-T, 1998

[2] G. Cote, B. Erol, M. Gallant, F. Kossentini; “H.263+: Video Coding for Low Bit Rates”, IEEE T-CSVT, 9/1998 [3] M. K. Uz, M. Vetterli, and D. J. LeGall. Interpolative Multiresolution Coding of Advanced Television with Compatible Subchannels. IEEE Transactions on Circuits and Systems for Video Technology. Volume 1 . Pp 86-99. March 1991. [4] K. Ramchandran, A. Ortega, M. K. Uz, M. Vetterli. Multiresolution Broadcast

for Digital HDTV using Joint Source-Channel Coding. IEEE Journal of Selected k e a s in Communications. Special issue on HDTV and Digital Video Communications. Volume 1 1. pp 6-23. 1993.

Suggest Documents