Modeling of Loss-Distortion in Hierarchical ... - Semantic Scholar

3 downloads 3151 Views 182KB Size Report
BC Canada V6T 1Z4. Email: {hassanm, panos, vikramk} @ece.ubc.ca ... closely approximates the loss-distortion of hierarchical coders for video sequences with ...
Modeling of Loss-Distortion in Hierarchical Prediction Codecs Hassan Mansour, Panos Nasiopoulos, Vikram Krishnamurthy Department of Electrical and Computer Engineering University of British Columbia 2356 Main Mall, Vancouver, BC Canada V6T 1Z4 Email: {hassanm, panos, vikramk} @ece.ubc.ca Abstract— Existing video distortion models used for hybrid video codecs do not match the hierarchical prediction structure employed in SVC. In this paper, we derive a new video distortion model that captures the sensitivity to frame losses of a hierarchical prediction video decoder using the Picture Copy (PC) concealment strategy of SVC. Existing frameworks only cover non-scalable coders, and therefore, fall short of accurately modeling scalable coders due to the additional parameters that come into play. Performance evaluations show that our model closely approximates the loss-distortion of hierarchical coders for video sequences with different levels of motion, with exceptional performance in sequences with homogeneous levels of motion.

prediction structure of SVC introduces additional parameters that are not considered in the existing model of [3]. In SVC, temporal scalability is achieved through a hierarchical coding structure for B-frames. The coded video frames are organized in groups of pictures (GOP)s such that one frame in every GOP is either a P-frame or an I-frame and the remaining frames are B-frames [4]. Figure 1 illustrates the difference in coding structures between non-scalable H.264/AVC and scalable SVC.

Keywords -Loss-Distortion modeling, SVC I. I NTRODUCTION The rising demand for video over wireless networks has called for advanced algorithms that can ensure the safe delivery of the video payload to its destination. These algorithms take into consideration the video statistics, its sensitivity to losses and the prevalent channel conditions. Therefore, lossdistortion modeling is playing a significant role in video transmission applications where QoS guarantees in terms of distortion, rate allocation, and power consumption need to be met [1] [2]. One example is video streaming applications in which a streaming server should be able to estimate the expected video distortion at the receiver in order to control the rate at which the video stream should be transmitted [1]. A loss-distortion model reflects the sensitivity of a decoded video stream to packet losses affecting the bitstream during transmission. These distortion models are defined by the coding structure of the video encoder and the concealment algorithm used at the video decoder. Rigorous work has been done to find an accurate lossdistortion model for hybrid video codecs, such as, H.263 and H.264/AVC [3] [1]. This model captures the sensitivity of a hybrid video codec composed of single level I, P, and B frames when faced with packet losses due to transmission errors. The resulting loss distortion arises from the concealment mismatch due to the imperfection of error concealment algorithms used at the decoder, in addition to the propagation error resulting from the temporal dependent structure of motion-compensated video codecs. Although this model accurately represents the loss-distortion of non-scalable video codecs, the hierarchical

Fig. 1. Comparison of the coding structure between non-scalable H.264/AVC and scalable SVC. The displayed frames are predicted from the frames to which the arrows are pointing.

The SVC frame sequence shown in Figure 1 is limited to a base-layer with standard (non-adaptive) GOP structure of eight frames per GOP. In non-scalable H.264, the loss of a B-frame will only incur a loss-distortion equal to the decoder concealment mismatch since no other frames depend on a Bframe. In the hierarchical prediction structure of SVC, the lossdistortion caused by the loss of a B-frame depends on the lost frame’s hierarchical level. Hence, we have derived a theoretical framework that models the effects of base-layer frame losses on the decoded picture quality of SVC. The remainder of this paper is organized as follows. Section II discusses the coding structure of a hierarchical prediction coder and the concealment strategy used for base-layer frameloss concealment. In section III we derive our new lossdistortion model for SVC. Finally, we present our simulation results and conclusion in section IV and section V, respectively.

Fig. 2. Dyadic hierarchical structure of two GOPs with 4 temporal scalability levels. The frame loss dependency among B-frames can be seen in the groups of empty frames where the lost frames are those that have an X underneath. The remaining empty frames are affected during decoding due to their dependency on the lost frames.

II. A NALYSIS OF THE C ODING S TRUCTURE AND C ONCEALMENT S TRATEGY IN SVC The use of a hierarchical prediction coding structure, such as the one used in SVC, enables temporal scalability of the coded video bitstream. By employing hierarchical B-frames, SVC maintains a fully predictive structure that is already provided in H.264/AVC [4]. In this paper, we focus our analysis on the sensitivity of an SVC decoder, using the PC concealment strategy, to packet losses in the base layer of a coded bitstream. A. Hierarchical Prediction Structure The hierarchical prediction structure imposes a coding dependency between the B-frames of a GOP. The benefits of this dependency are translated into increasing the coding efficiency compared to non-hierarchical predictive AVC [5], and enabling temporal scalability, where B-frames at the lower hierarchical level can be discarded during transmission to produce a bitstream with a lower temporal resolution. For instance, if the B-frames at level 3 in Figure 2 were lost, the temporal resolution is reduced to half of the original resolution. Although SVC offers varying decomposition structures with multiple hierarchical levels, we will limit our discussion to the case of 4 dyadic hierarchy stages with 8 frames per GOP in the remainder of this paper. Figure 2 shows the dependency of lower level hierarchical B-frames on the higher level frames. In this figure, frames B2 , B7 and B12 are marked as lost. As a result, the loss of frame B2 causes the degradation of frames B1 and B3 . The loss of B7 will only affect frame 7 since it has no dependencies. Similarly, the loss of B12 caused the degradation of all the B-frames in GOP 2. As for the losses that might affect I and P frames, the loss of an I-frame will affect all dependent Bframes and all subsequent frames until an Intra-refresh (IDR) frame is received. The loss of a P-frame affects all dependent B-frames and all subsequent frames until the next Intra-refresh frame arrives. B. The Picture Copy (PC) Concealment Strategy in SVC In the case when full frames are lost in the base layer of an SVC coded stream, the decoder uses a picture copy (PC) concealment algorithm where each lost frame is replaced

by the previous temporal picture in the higher hierarchical level [4]. For instance, if the frame B2 in Figure 2 is lost, it is concealed using frame I0 . Similarly, frames B7 and B12 are concealed using frames B6 and B8 respectively. The concealment of I and P frames is performed using only the last I or P frames. III. L OSS -D ISTORTION M ODELING The loss-distortion model derived in [3] offers an accurate analysis of the decoder sensitivity of non-hierarchical motioncompensated video coders. However, due to the difference in coding structure of hierarchical predictive coders, such as SVC, as illustrated in the previous sections, a new framework is required to accurately model the packet loss sensitivity. Consider a base layer SVC stream with a constant GOP size G, dyadic hierarchical decomposition, and an Intra-refresh period T = M × G frames. That is, the coded video stream starts with an I-frame and has a P-frame after every G − 1 Bframes and then has another I-frame every T pictures. Figure 3 shows an example of a coded sequence satisfying the above mentioned conditions.

Fig. 3. Example of an SVC base layer coding sequence with a fixed GOP size G and an intra-refresh period T. The 4s indicate the mean square error between the specified frames.

In this paper, we will use the mean square error (MSE) between the decoded pictures without packet loss and the same decoded pictures with loss and after error concealment as a measure for loss-distortion. For tractability, and without loss of generality, let G = 8 and let 41 be the average MSE between two consecutive frames, 42 the average MSE between two frames that are one picture apart, 43 the average MSE between two frames that are two pictures apart, . . . as shown in Figure 3. The distortion caused by the loss of a frame amounts to the mismatch in the concealment strategy used

to replace the lost frame in addition to the propagation error that arises from the prediction dependency between the coded frames. Note that the error propagation due to B-frame losses is limited to a single GOP, whereas the error propagation due to I/P frame loss extends to multiple GOPs. Therefore, we will divide our discussion into an analysis of the distortion contributed by the loss of B-frames and of the distortion caused by the loss of I or P frames, also known as key pictures. A. Distortion due to B-frame loss Consider the coded video sequence described above with a constant GOP size of 8 frames. Using a dyadic decomposition structure, the 8 frames in the GOP are classified into 4 temporal levels, where the key picture is at level 0, one Bframe at level 1 (B4 in Figure 2), two B-frames at level 2 (B2 , B6 in Figure 2), and four B-frames at level 3 (B1 , B3 , B5 , B7 ). Let DB1 correspond to the average distortion induced by the loss of a level 1 frame, DB2 be the average distortion induced by a loss of a level 2 frame, and DB3 be the average distortion induced by the loss of a level 3 frame. We will assume that the level of motion is more or less homogeneous throughout a video sequence, hence, 41 , 42 , and 44 can correspond to the concealment mismatch between two consecutive frames, two frames that are one picture apart, and two frames that are 3 pictures apart, respectively. Figure 2 illustrates the loss dependency of B-frames. The loss of a level 3 frame will only cause a distortion equal to the concealment mismatch due to the PC concealment algorithm, therefore, DB3 = 41 .

(1)

The loss of a level 2 frame will cause a distortion equal to the concealment mismatch of the lost frame in addition to the prediction error propagating to two level 3 frames. Experimental tests have shown that the propagation error is equivalent to half the concealment mismatch for each dependent frame adjacent to the lost frame. For instance, if frame B2 is lost, then the resulting loss-distortion is equal to the concealment mismatch that is the MSE between picture 2 and picture 0 plus the propagating prediction error affecting pictures 1 and 3. As a result, the average loss-distortion of a level 2 frame is expressed as DB2 = 42 + 2 × (42 /2).

(2)

Finally, the average loss-distortion of a level 1 frame is similarly equal to the concealment mismatch of the lost frame plus the prediction error propagating to all level 2 and level 3 frames in the GOP. Furthermore, we ran experiments on 2 different video sequences (Foreman and Bus) simulating random losses independently to each level and observed that, on average, the propagation error is reduced by 3dB (division by 2 in MSE) as we move farther away from the lost frame. The average loss-distortion of a level 1 frame is therefore expressed as DB1 = 44 + 2 × (44 /2) + 2 × (44 /4) + 2 × (44 /8). (3)

Assuming equal error protection for all frames in an SVC baselayer, each B frame is faced with the same loss probability irrespective of its temporal level, therefore, the expected lossdistortion due to hierarchical B-frame loss is written as 1 2 4 E[DB ] = DB1 + DB2 + DB3 . (4) G−1 G−1 G−1 B. Distortion due to key-picture (I/P-frame) loss The distortion caused by the loss of a key picture is not restricted to a single GOP, it rather extends to affect multiple GOPs. Figure 4 shows the dependence of the affected frames on the location of the key picture within the intra-refresh period.

Fig. 4. Range of frames affected by the loss of key pictures. The affected frames vary according to the position of the lost key frame in the intra-refresh period.

We let I be the IDR frame in an intra-refresh period and DI be its corresponding loss-distortion. Also, let the P-frames inside of an intra-refresh period be labeled as P1 , P2 , . . . , PM −1 and let DP1 , DP2 , . . . , DPM −1 be their associated lossdistortions. As stated earlier, every frame in the base-layer has an equal probability of loss, therefore, if the probability of a loss occurring to a key-picture is one, then the probability that 1 any one of the key pictures will be lost is equal to M . The distortion caused by the loss of I is equal to: •

• •

the concealment mismatch approximated by 48 , if the previous key picture is correctly decoded and by 248 if the previous key picture is also lost the error propagating into the B-frames of the GOPs adjacent to I the error propagating into the subsequent P frames and their dependencies.

The loss-distortion of I is therefore expressed as DI = [4G (1 − p) + 24G (p)] × G PM −1 + i=1 4G (1 − γi) × G, γ =

1 M

where G is the number of frames per GOP, p is the probability that the previous key picture is lost, and γ is the discount in the error signal caused by INTRA update macroblocks and loop-filtering in the subsequent key-pictures. We assume in the above formulation that the B-frames in each GOP suffer the same degradation as the key picture. This analysis can be generalized to the remaining keypictures in an intra-refresh period. Hence, it follows that the

loss-distortions of P1 , . . . , Pj , . . . , PM −1 are expressed as DPj = [4G (1 − p) + 24G (p)] × G PM −j−1 + i=1 4G (1 − γi) × G, γ =

1 M −j

(5)

where j ∈ {0, 1, 2, . . . , M − 1}, index j = 0 indicates the I frame. Consequently, the expected distortion due to the loss of a key-picture E[Dkey ] formulates to PM −1 1 E[Dkey ] = j=0 M DPj = [4G (1 − p) + 24G (p)] × G PM −1 PM −j−1 1 +M 4G (1 − γi) × G, (6) j=0 i=1 γ=

1 M −j .

Finally, the total loss-distortion Dloss of a hierarchical prediction coder faced with a packet loss rate Ploss is the combined distortions of equations 10 and 9 and is expressed as 1 G−1 Dloss (Ploss ) = ( E[Dkey ] + E[DB ]) × Ploss . (7) G G

Fig. 5. Loss-distortion of the Foreman sequence with packet losses of 3%, 5%, and 10% as a function of the Intra-refresh period.

IV. S IMULATION R ESULTS In order to test our model we encoded 209 frames each of the sequences Bus and Foreman with a GOP size of 8 at three different intra-refresh periods; 24, 48, and 96 frames per intra period. We used the SVC software JSVM-3 available from [6] and encoded a QCIF base layer at 15 frames per second and QP = 32, and added two FGS enhancement layers each with delta QP of 6 to get a picture quality with final QP of 20. We simulate packet losses by subjecting the base-layer alone to 3%, 5%, and 10% losses using the the erasure simulation patterns ITU-T VCEG Q15-I-16r1 to model Internet and 3GPP/3GPP2 packet-loss environments [7]. We do not subject the FGS enhancements to errors since we are NOT interested in modeling the effects of losses in FGS enhancements in this paper. Moreover, when a base layer frame is lost, its FGS enhancements are discarded at the decoder. Figures 5 and 6 reveal how closely our model approximates the experimental data collected from the simulations. The continuous line represents the distortion predicted by our model as a function of the intra-refresh period M in GOPs ranging between 3 and 12. The circles are the measurements gathered from the loss experiments on the sequences Foreman and Bus for Intra periods of 3, 6, and 9 GOPs. Notice that the model performs better with the Bus sequence where the motion is almost uniform throughout the sequence. Since the Foreman sequence has varying levels of motion throughout the sequence, the performance of the model drops a little. This is due to taking average concealment errors 41 , 42 , 44 , . . . for the entire sequence. Taking average 4s is similar to making an assumption that the level of motion is fixed. In order to overcome this problem, average 4s can be taken over shorter stretches of a sequence where the motion is homogeneous, and then update the model parameters for real-time streaming.

Fig. 6. Loss-distortion of the Bus sequence with packet losses of 3%, 5%, and 10% as a function of the Intra-refresh period.

V. C ONCLUSION In this paper, we derived a theoretical framework that closely models the loss-distortion of hierarchical prediction coders such as SVC. Existing frameworks only cover nonscalable coders, and therefore, fall short of accurately modeling scalable coders due to the additional parameters that come into play. Our model can be expressed as: Dloss (Ploss ) = (

1 G−1 E[Dkey ] + E[DB ]) × Ploss , G G

(8)

where

PM −1 1 E[Dkey ] = j=0 M DPj = [4G (1 − p) + 24G (p)] × G PM −1 PM −j−1 1 +M 4G (1 − γi) × G, j=0 i=1 γ=

1 M −j .

(9)

and

log2 (G)

E[DB ] =

X

k=1

2k−1 DBk . G−1

(10)

Performance evaluations show that our model closely approximates the loss-distortion of hierarchical coders for video sequences with different levels of motion. The model also demonstrates exceptional performance in sequences with homogeneous levels of motion. R EFERENCES [1] X. Zhu, E. Setton, and B. Girod, “Congestion-distortion optimized video transmission over ad hoc networks,” vol. 20, no. 8, pp. 773–783, September 2005. [2] Y. W. X. Lu and E. Erkip, “Power efficient H.263 video transmission over wireless channels,” in International Conference on Image Processing, vol. 1, September 2002, pp. 533–536. [3] K. Stuhlm¨uller, N. F¨arber, M. Link, and B. Girod, “Analysis of video transmission over lossy channels,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 6, pp. 1012–1032, June 2000. [4] J. Reichel, H. Schwarz, and M. Wien, “Joint scalable video model jsvm5,” ISO/IEC JTC1 / SC29 / WG11 N7796, January 2006, Bangkok, Thailand. [5] H. Schwarz, D. Marpe, and T. Wiegand, “Hierarchical B pictures,” ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6 JVT-P014, July 2005, Poznan, Poland. [6] “JSVM-3 Software,” ISO/IEC JTC1 / SC29 / WG11 N7312. [Online]. Available: http://mpeg.nist.gov/reg/ listwg11 73.php [7] S. Wenger, “Error patterns in Internet video experiments,” ITU-T SG16 document Q15-I-16-R1, October 1999.

Suggest Documents