a rate control method for h.263 temporal scalability - CiteSeerX

0 downloads 0 Views 180KB Size Report
coded frame rate. The proposed methodology extends the base layer rate control to the enhancement layer and incorporates an adaptive technique for enhance ...
A RATE CONTROL METHOD FOR H.263 TEMPORAL SCALABILITY Faisal Ishtiaq and Aggelos K. Katsaggelos Department of Electrical and Computer Engineering Northwestern University, Evanston, IL 60208, USA ffaisal,[email protected] ABSTRACT

In this paper a novel methodology for temporal scalability within the H.263 video coding standard is presented. Temporal scalability is de ned as the transmission of previously dropped frames in the form of a scalable enhancement layer to increase the overall encoded frame rate. The proposed methodology extends the base layer rate control to the enhancement layer and incorporates an adaptive technique for enhancement layer frame selection. Three criteria important in the selection of enhancement frames are identi ed, allowing us to adaptively choose frames such that the overall temporal resolution of the encoded video sequence is enhanced. Experimental results are provided and compared to a non-adaptive technique where enhancement frames are selected solely on the decay of the enhancement layer bu er.

1. INTRODUCTION In Low and Very Low Bitrate (VLBR) video coding a crucial step towards achieving the desired bitrate is frame dropping. This is important for an e ective bit allocation strategy where by encoding only a subset of frames from the set of all source frames, acceptable visual quality can be achieved and maintained. In VLBR coded video this typically leads to an original video source at 30 frames per second (fps) being encoded with a frame rate of 5-10 frames per second. This decline in the encoded frame rate results in the loss of temporal resolution making the encoded sequence appear less smooth, jerky, and visibly annoying. Furthermore, some frames containing objects not present in others may be among the dropped frames resulting in the total loss of the object in the encoded sequence. To overcome this loss in temporal resolution, video coding standards such as H.263 [1] and MPEG-4 [2] provide for an enhancement layer consisting of those frames previously This work was supported in part by a grant from the Motorola Center for Communications at Northwestern University.

dropped. This is referred to as temporal scalability. In this paper we will detail an approach to adaptively select frames in the enhancement layer. Scalability is a mechanism for providing varying levels of video quality within a single bitstream. In doing so, it eliminates the need to re-encode and store compressed video sequences at di erent quality levels. By using scalability, a single encoded bitstream can provide several levels of quality that can be extracted. Scalability works by adding enhancement layers to a lower rate base layer. As more and more enhancement layers are combined, the better the video quality becomes. Furthermore, because there is no need to reencode the source for di erent rates and to store different versions of the same sequence, this saves both computational resources and storage space. The enhancement in quality can be in the form of increased signal-to-noise ratio (SNR), temporal continuity, and/or spatial resolution. Scalability used to enhance the SNR quality of a frame is referred to as SNR scalability. Temporal scalability is used to refer to scalability designed to increase the temporal resolution of a video sequence by increasing the encoded frame rate. Finally, the term spatial scalability is used to address scalability that enhances the spatial resolution, or dimensions, of a frame. Within the scope of the video coding standards, MPEG-2 [3], MPEG-4, and H.263 all support one or more of the above forms of scalability. The two most recent standards, H.263 and MPEG-4, support all three forms of scalability as well as de ning the syntax such that combinations of the three can be used. In a scalable bitstream, di erent enhancement layers can o er di erent types of scalability, or di erent types of scalability can be merged into a single layer. Investigations into SNR, temporal, and a hybrid form of SNRtemporal scalability can be found in [4]. In this work we shall concentrate on temporal scalability within the H.263 video coding standard. Investigation into temporal scalability within the MPEG-2 standard at high

 Bitrate Adherence : Foremost, the enhancement

Enhancement Layer

Base Layer

B

P/I

B

P

P

Figure 1: Overview of temporal scalability bitrates can be found in [5].

2. TEMPORAL SCALABILITY Temporal scalability is de ned within the H.263 standard as the encoding of bidirectionally predicted (B) frames in an enhancement layer temporally located between two previously encoded base layer frames in an enhancement layer as shown in Figure 1. By increasing the frame rate using a scalable enhancement layer rather than encoding the sequence in a non-scalable fashion at a higher frame rate, a substantial amount of functionality arising from scalability is gained. While being well suited for video delivery over heterogeneous networks, scalability provides a mechanism for delivering video at a quality level respective of the bandwidth available to the user. It avoids the need for encoding and storing multiple copies of a sequence at di erent rates while also providing a natural platform for unequal error protection (UEP) where a baseline level of quality may be guaranteed in error prone environments.

layer must be regulated such that the bit budget for the layer is met. This is an important criterion since the number of encoded B frames also dictates the number of bits spent.  Motion : An important location where a B frame will serve to enhance the temporal continuity the most is between two base layer frames with high motion. Alternatively, if there is little motion between two base layer frames, then the propensity to place a B frame is tempered. For these reasons, motion is a criterion that subjectively enhances the overall temporal ow of the sequence and is given high preference.  Frame Separation : Since the temporal distance between two encoded frames is not constant over the encoded sequence, the issue of temporal separation must be taken into account when deciding on the location of a B frame. It is judicious to decide that if the temporal distance between two base layer frames is high, then the likelihood of a B frame to smooth the temporal ow should also be high. Because each of the factors above will have a di erent bearing on the placement of the B frame, they must be weighted according to their importance. Furthermore, models for the nature of their individual contributions towards the decision have to be developed. Let P = [p0 ; p1 ; : : : ; p ,1 ] be the enumerated set of I and P frames that have been encoded in the base layer. By default p0 is the anchor, or Intra, frame. Then given frame i, de ne by p (i) and p (i) the encoded I or P base layer frames immediately preceding and succeeding frame i. By de nition p (i) and p (i) 2 [p0 ; p1 ; : : : ; p ,1 ]. Furthermore, de ne b (i) as the previously encoded enhancement B frame given frame i. Based on the above considerations we introduce the following frame selection function F (i) F (i) = F (i) + F (p (i); p (i))+

F (p (i); p (i)) , T (b (i)): (1) N

p

f

p

3. METHODOLOGY The two essential areas of research in temporal scalability are the rate control issues associated with an enhancement layer and the challenging problem of the selection of the B frames to be encoded. In this work we focus primarily on the latter issue incorporating the decision process into a modi ed TMN7 [4] rate control. In the development of the enhancement layer, our goal is to place the B frames adaptively such that they are placed in locations where they will contribute to the temporal continuity the most. Thus, we have de ned three fundamental decision criteria that are pertinent in making an intelligent choice for the B frame. They are, in order of importance:

f

N

p

B

S

M

p

p

f

f

B

p

F (i) can be interpreted as the probability that frame i will be encoded as a B frame, while F (i), F (p (i); p (i)), and F (p (i); p (i)) as the individual probability functions that i will be encoded as a B frame based on the bitrate, motion vectors, and frame separation, respectively. T (b (i)) serves as a penalty term used in conjunction with the bu er to ensure that the bitrate is regulated. Finally, , , and are scalar weights since B

S

p

f

B

p

M

p

f

TH

TH

FM ( p p ( i ), p f ( i )) 1.0

Probability

not all of the factors contribute equally to the selection process. Thus the decision to code the current frame, i, as a B frame is made if F (i) exceeds a given threshold value, F , that is if F (i)  F Encode frame i as B frame. (2) In this methodology each of the functions contributes to the decision process based upon its characteristics. These characteristics are de ned by the service function models for motion, frame separation, and bitrate regulation.

mv Pn

0 a

b

Motion Vector Magnitude

3.1. Motion

Let mv and mv be the horizontal and vertical motion vector components of macroblock j of frame p . The average magnitude of the motion between frames p ,1 and p is then de ned as x;j

Figure 2: Motion service function.

y;j

n

n

n

mv n = J1 p

X qmv J

j

=1

2

x;j

+ mv2 : y;j

(3)

If the average magnitude is small, the probability of the current frame being coded should be low since it can be inferred that the motion between the encompassing base layer P frames is small. On the other hand, if mv n is large, then signi cant motion has taken place and the likelihood that a B frame is to be placed here should be high. In areas where the average motion is moderate, a graduated scale for the probability of a B frame placement can be used. Using this information a model for the probability of a B frame being encoded given a certain mv n is constructed. In the initial development of a probability model, F (p (i); p (i)), is a linearly increasing function bounded by maximum and minimum motion values, as shown in Figure 2. This model, although simple, re ects the general nature of the relationship between the average motion and B frame position. p

p

M

p

f

3.2. Frame Separation

A subtle factor considered in the frame selection mechanism is frame separation. While frame separation is not as important as motion or bitrate adherence it does warrant input into the B frame selection process. The decoder in any video compression algorithm takes the encoded frames, decodes, and displays them according to their timestamp values. However since the display is updated at the source frame rate and the encoded frames are only available at irregularly spaced intervals, the decoder holds the previous frame until the next decoded frame is available. Any sizeable temporal di erence between two encoded frames will cause

the decoder to display frames in a "snapshot" behavior where a frame will be displayed for a period of time before it is updated. The model used for F (p (i); p (i)) is similar to the motion vector model and is shown in Figure 3. It is a linear model that takes into account how long a frame can be held at the decoder before an update becomes necessary. S

p

f

3.3. Bitrate Adherence An important criterion for judging any rate control algorithm is the ability to achieve its speci ed bitrate. This is an aspect that must be considered by the frame selection mechanism. Since there is always a speci ed bit budget the enhancement layer, regardless of the pressures exerted by the motion and frame separation functions, the bitrate of the layer must be met. This is incorporated into the selection mechanism by the functions F (i) and T (b (i)) used in conjunction with the enhancement layer bu er. Let b (i) be the previously encoded B frame. We de ne by NDBF (Natural Bu er Decay Frame) the frame that would be encoded as the bu er decays naturally below the Frame Drop Threshold. The relationship between NDBF , the bu er, and b (i) is shown in Figure 4. In it the bitrate adherence function F (i) and the penalty term T (i) are also shown. F (i) is a linearly increasing function beginning at b (i) reaching 1.0 at the NDBF , while T (i) is de ned as B

B

p

p

p

B

B

B

p

B

T (i) = 1 , F (i): B

B

(4)

As frames are encoded, the penalty term is allowed to accumulate so that memory as to whether or not the previous B frame was placed before or after its NDBF exists. This bitrate service function and the penalty term are the basis for regulating the bitrate.

FS ( p p (i ), p f (i ))

Buffer Content

1.0

Buffer Content

Probability

FB(i) 1.0

TB(bp(i))

0 c

d

Frame Distance

Frame Separation

Figure 3: Frame separation service function.

4. RESULTS Results using the temporal scalability B frame selection mechanism are shown in Tables 1(a)-(c). The results are for 10 seconds, 300 frames, of the \Foreman", \Mother-Daughter", and "Akiyo" sequences. In the results shown, the base layer is encoded at 48 Kbps with an enhancement layer (Temporal Scalability (TS) Layer) of 16 Kbps for a total rate of 64 Kbps. A modi ed TMN7 rate control algorithm is used to regulate the bitrate of the base layer while in the enhancement layer the bu er is used with the frame selection mechanism. The weights of the individual service functions were experimentally selected based upon results obtained with the \Foreman" sequence. In the results presented, the following values were used: = 50, = 40, = 10, and F = 50. In the motion service function, the values of a and b were equal to 5 and 10, respectively, while for the frame separation service function the values of c and d were equal to 2 and 8, respectively. Although in the results presented the weights are kept constant, they can also be adjusted in response to the source contents. In Figure 5 the average motion vector magnitude between the base layer frames is plotted as a function of the frame number for the \Foreman" sequence . Here the location of the P frames (+) and the B frames (}) in the enhancement layer are marked. It can be seen that in locations where there is high motion, B frames are inserted. Since B frames are much more eciently encoded than frames in the base layer, as can be seen in Table 1. For the \Foreman" sequence they are able to provide a comparable PSNR while using only 36% of the bits used for a P frame. Using scalability we are able to increase the frame rate from 6.1 fps to 11.5 fps, roughly doubling the frame rate, with only 36% more TH

Frame Drop Threshold

bp(i)

NDBF-1 NDBF

Frame

Figure 4: Bitrate service function and penalty term. bits. In Figure 6 the positions of the encoded B frames (}) are shown with respect to the average motion magnitude between the base layer for the \Foreman" sequence using temporal scalability with the B frames being placed solely on the decay of the enhancement layer bu er. Here, the NDBF is always coded regardless of the motion or frame separation. This gure shows that the B frames are dispersed throughout the sequence more uniformly. This is in contrast to the results from the B frame selection function given in Figure 5. We note that in the results using only the bu er decay frames, in some instances the base and enhancement layer frames fall next to each other as can be seen near frames 75 and 170 in Figure 6. Here B frames are positioned immediately preceding and following the base layer frames, respectively. This is a case where the enhancement layer is not being utilized e ectively. Intuitively, if two base layer frames are apart by 6 time steps, it is most likely that the \best place" to add an enhancement frame is closer to the midpoint rather than one time step away from any of the two base layer frames. The new frame selection mechanism takes this into account and as can be seen in Figure 5 near frames 75 and 170, the B frames are placed midpoint between the two base layer frames.

5. CONCLUSIONS The results show that using temporal scalability we are able to roughly double the frame rate with a comparable frame quality with only a third of the bits. However, the importance of these results is that the enhancement layer frames are selected adaptively based on the motion, frame separation, and the bitrate. This frame selection mechanism utilizes service function models for

(a) \Mother-Daughter" Base Layer(48Kbps) TS Layer(64Kbps) Average PSNR 39.08 39.03 Avg. Bits per Frame 6544 2059 Frames Encoded 76 78 Total Frames Encoded 154 Bitrate 48.77 64.72

(b) \Akiyo" Base Layer(48Kbps) TS Layer(64Kbps) Average PSNR 41.50 41.37 Avg. Bits per Frame 5844 1930 Frames Encoded 85 82 Total Frames Encoded 167 Bitrate 48.94 64.71

(c) Table 1: Temporal scalability results at a base layer of 48 Kbps and an enhancement layer (Temporal Scalability (TS) Layer) of 16 Kbps for the (a) \Foreman" (b) \Mother-Daughter" and (c) \Akiyo" sequences.

Average Motion Vector Magnitude

20

15

10

5

0 50

100

150

200

20

15

10

5

0 0

50

100

150

200

250

300

Frame

Figure 6: Temporal scalability results showing B frame positions (}) with respect to the average motion vector magnitudes in the base layer where the selection of the B frames is based solely on the enhancement layer bu er. The positions of the base layer frames are indicated by (+). the characteristics of each of these criteria, and has shown to adaptively place B frames in an intelligent manner. This is a much more e ective method than encoding enhancement frames solely based on the bu er decay. The service functions can be adapted for certain classes of sequences or in response to certain conditions within a sequence. Furthermore, this mechanism requires minimal computations since all required data is readily available at the encoder.

6. REFERENCES

"Foreman" Temporal Scalability 48-64 Kbps 25

0

"Foreman" Temporal Scalability 48-64 KBps 25

Average Frame Motion Vector Magnitude

\Foreman" Base Layer(48Kbps) TS Layer(64Kbps) Average PSNR 32.88 32.29 Avg. Bits per Frame 8353 3059 Frames Encoded 60 52 Total Frames Encoded 112 Bitrate 49.29 65.20

250

300

Frame

Figure 5: Temporal scalability results showing B frame positions (}) with respect to the average motion vector magnitudes between two base layer frames. The positions of the base layer frames are indicated by (+).

[1] International Telecommunication Union Recommendation H.263. Video coding for low bitrate communications, January 1998. [2] ISO/IEC 14496-2 MPEG-4. Information technology-coding of audio-visual objects: Visual, October. 1997. [3] ISO/IEC 13818-2 MPEG-2. Information technology - generic coding of moving pictures and associated audio information: video, 1995. [4] F. Ishtiaq. H.263 Scalable Video Coding and Transmission at Very Low Bitrates. PhD thesis, Northwestern University, June 1999. [5] H. Sun and W. Kwok. MPEG video coding with temporal scalability. In International Communications Conference, volume 2952, pages 1742{1746. IEEE, 1995.

Suggest Documents