➠
➡ Quality monitoring for compressed video subjected to packet loss Amy R. Reibman and Vinay Vaishampayan AT&T Labs – Research famy,
[email protected] Abstract— We consider monitoring the quality of compressed video transmitted over a packet network from the perspective of a network service provider. We estimate the Mean-Squared Error (MSE) caused by packet loss by examining only the received video bitstream. We describe our FullParse method, which extracts sequence-specific information including spatio-temporal activity and the effects of error propagation. We show that FullParse performs well with manageable complexity.
I. I NTRODUCTION We consider monitoring the quality of compressed video transmitted over a packet network from the perspective of a network service provider. Our motivation is to provide a tool that can quickly and efficiently monitor the quality of video that is being transported across a network. We are particularly interested in measuring the impact of packet loss on the perceived video quality. Since the original video is not available for comparison, we focus on quality monitoring tools that operate on the bitstream. In this paper, we use mean squared error (MSE) as a rough measure of video quality. Our methodology can be used for any motion-compensated video compression algorithm including MPEG-1, MPEG-2, and H.263, and can be used in any network environment (UDP, ATM, etc.). However, in this paper we consider the specific application of transporting MPEG-2 video when Transport Packets may be lost. In Section II, we discuss different measurements that can be used to determine video quality for a network monitoring application. In Section III, we describe the impact of packet losses on the decoded video. In Section IV, we present a detailed description of our FullParse method that estimates the MSE from the received video bitstream. FullPrase predicts the quality of individual videos by taking into account sequencespecific factors including spatio-temporal activity and the effects of error propagation, with low enough complexity that it can be easily applied to many different video streams being sent across the network. Section V presents its performance. II. M ONITORING NETWORKED VIDEO We focus on measuring the quality of MPEG-2 video subject to Transport Packet losses. Because we are interested in estimating quality in a network on many streams simultaneously, we choose to use only measurements on the video bitstream itself, as shown by measurement C in Figure 1. Methods that use measurement A, the original video pixels, (including the Full-Reference (FR) and Reduced-Reference
0-7803-7965-9/03/$17.00 ©2003 IEEE
A
encoder
network
Fig. 1.
C
decoder
B
Measurements for evaluating video quality
(RR) methods in [1], [2], [3]) are not feasible, because in our application the original video is not available. Methods which use measurement B, the decoded video pixels, require a video decoder for every stream being monitored, which may be prohibitive for our application. Because we rely only on the video bitstream, C, the quality monitor must rely on assumptions regarding how the actual decoder handles error concealment. Further, we are not able to account for any additional losses that occur between our measurement and the user’s decoder. However, with this method we can extract explicitly the location of a packet loss, and we do not require a complete decoder for every stream. We focus on measuring the quality of video when it is subject to packet losses. To also measure the video quality when there are no packet losses, we could incorporate the methods presented in [4] and [5], which estimate quantization MSE using only bitstream information. We use the term MSE throughout this paper, although the output of the system could include bounds on the maximum or minimum error. This may be particularly useful to characterize different expected errors depending on the decoder concealment strategy. III. I MPACT OF PACKET LOSSES A single packet loss affects an initial set of macroblocks
M0 , with each pixel in those macroblocks having an initial
error ein . Because of the motion-compensated prediction process , these errors propagate temporally and spatially. Each of these three factors affects the resulting average MSE for the entire sequence, 2seq . Each of these factors, which we consider to be random variables, depends on the particular packet that is lost. For example, a single packet loss could affect an entire frame or just a slice. A single packet would incur a large e in if it falls in an area with high motion, or small e in in a mostly still region. The location of the packet loss relative to the prediction process (which includes I-block updates) controls the amount of subsequent error propagation. Information from both the encoder and the decoder is necessary to determine the actual value of 2seq for a particular packet loss. Methods which compute the estimated decoder
I - 17
ICME 2003
➡
➡ distortion at the encoder [6], [7], [8], [9] can make use of all information available to the encoder, but must rely on estimates of M0 and assumptions regarding the decoder concealment strategy. (which affects the magnitude of the initial error ein ). In this paper, we address the situation of using the bitstream to be received by the decoder to compute the MSE caused by lost packets. We cannot know the contents of the lost packets and thus cannot know e in accurately. However, we can extract information to know M 0 and for the actual losses in the received bitstream. An essential aspect of this work is to find effective estimates of ein from bitstream information. In general, it is important to consider the interaction among two or more losses [10]. However, in our application of MPEG-2 video with Transport Packet losses, the assumption that multiple losses are uncorrelated appears to hold fairly well for loss rates in the range of interest. Hence we consider a simple additive model to account for multiple packet losses. IV. V IDEO QUALITY FROM BITSTREAMS In [11], we presented three methods to estimate MSE from the video bitstream, which vary in their ability to predict sequence-specific quality and in their ability to scale to measuring multiple streams in a network environment. These three methods lie at natural boundaries for how deeply to parse the bitstream. System constraints may dictate how much bitstream processing is possible. In the FullParse (FP) method, the only restriction is that a complete decoding is not possible; no IDCT or motion compensation of decoded pixels takes place. In the QuickParse (QP) method, we use only high-level parsing, searching for start codes and extracting the header immediately following. In the NoParse method (NP), we do not examine the contents of the video bitstream, but instead rely only on the number of packet losses and average bit-rate. The three methods differ in the accuracy with which they can estimate the three random variables M 0 , ein , and . In this paper, we present a detailed description of the FullParse method. By parsing the bitstream using a variable length decoder, we extract sufficient information to determine exactly the set of macroblocks initially affected by a loss, M0 , and the prediction process, . Section IV-A describes how FullParse uses M0 and to recursively track the error at the pixel granularity ( 2p ). The initial error for each pixel is difficult to extract from the bitstream because its value depends heavily on the source content. The strategy used by the decoder for concealment also affects ein . When the impact of losses is estimated at the encoder [6], [7], [8], [9], [10], the contribution of the source to this value can be readily incorporated. However, because we estimate ein using the received bitstream, an essential aspect of our work is to find an accurate estimate of this parameter. In section IV-B, we present a statistical model that allows us to estimate the initial error caused by a loss with macroblock granularity ( 2mb ). In section IV-C, we describe how we estimate the necessary model parameters using information extracted from the video bitstream.
A. Recursive estimation of error due to lost packets FullParse uses the information M 0 and which it extracts exactly from the bitstream to keep track of 2p (n i), the reconstruction error at pixel i in frame n, where the p subscript denotes pixel granularity. Its recursive framework naturally accounts for the impact of error propagation. We consider first those macroblocks that are received, or equivalently, those macroblocks not in M 0 . In this case, FullParse uses the motion information of to propagate any previous error, depending on frame type. Assuming there is one B-frame between every reference frame, and pixel j t from reference frame n ; t is used to predict pixel i in the current frame, then for macroblocks not in M 0 , if (i) 0 2 (n ; 2 j ) if (ii) 2 2p (n i) = p2 (n ; r jr )=4 if (iii) p 2 (n + 1 j ))=2 if (iv) (2 ( n ; 1 j ) + 1 ; 1 p p (1) Condition (i) is when an I-macroblock is received, which resets any prior errors. Condition (ii) is when a P-macroblock is received, where no new errors are introduced but past errors propagate. Condition (iii) is when a B-macroblock is received and only the pixel in reference frame n ; r has nonzero error, while condition (iv) is when a B-macroblock is received and the pixels in both reference frames have nonzero errors. For (iv), when forming the squared error, we assume the cross-term contributes the same impact as each of the squared terms. Any inaccuracy in this assumption will not propagate into other frames, because this is for a B-block. Let f^ni be the pixel value at the encoder at location i and time n reconstructed after encoding, and f~ni the corresponding reconstructed pixel value at the decoder after possible losses. If there have been no losses, then f^ni = f~ni . For a missing macroblock, namely one in M 0 , we assume the decoder uses motion compensated error concealment from the closest reference frame. Then, the pixel from frame n ; t that is used to estimate the current pixel i in frame n is denoted f~nk;t . If simple replacement from the previous frame is done, then k = i. The decoder error at pixel i in frame n can be written as
8 >< >:
f^ni ; f~nk;t = (f^ni ; f^nk;t ) + (f^nk;t ; f~nk;t )
(2)
The first term in (2) is the new error introduced by the lost packet, and the second term is the propagation of errors from previous packet losses. In general, these two terms are correlated [10]. However, they are nearly uncorrelated unless earlier errors affected the same pixel in the reference frame. Since for our application this rarely happens, we assume they are uncorrelated. Therefore, the second moment is
h
2p (n i) = E (f^ni ; f^nk;t )2
i
+
2p (n ; t k)
(3)
The first term has contributions both from the missing difference information and because the decoder used an incorrect motion estimate (k instead of j ). The first term, which we denote 2p0 , can be estimated statistically from data, as described below. The second term is the effect of past error propagation.
I - 18
➡
➡ B. A model for the initial MSE of a missing macroblock
the decoder uses zero-motion concealment:
The first term in (3) is not directly related to either the intra-frame energy or the difference frame energy. Therefore, we use a statistical model to characterize 2p0 for missing macroblocks. We assume the decoder uses no motion-compensated concealment, or k = i. The more general case follows easily. We can write 2p0 (n i) due to this macroblock being lost as
2p0 (n i) = 2n2 +h(in ; in;t )2 i ; ( jhuh j jvuv j ) 2n2 + (in ; jn;t )2 ; d2 :
(10)
All the parameters in (10) can be estimated from parameters extracted from the bitstream with variable length decoding and inverse quantization, as explained next. Motion compensation and Inverse Discrete Cosine Transform are not necessary.
2p0 (n i) h i i i i i i i 2 C. Estimating spatio-temporal video parameters = E (f^n ; n + n ; f^n;t + n;t ; n;t ) h i i 2 i h ^i i We estimate the seven statistical parameters in (10) from i 2 = E (f^n ; n ) + E (fn;t ; n;t ) (4) the received bitstream on a macroblock basis: i , 2 , 2 , v , n n d h i , uv , and uh . ;2E (f^ni ; in )(f^ni ;t ; in;t ) + (in ; in;t )2 hIn the current implementation, we keep an estimate of each
h i
h
i
where in = E f^ni . If we let n2 = E (f^ni ; in )2 , and assume that n2 ;t n2 , then we can write this as
2p0 (n i) = 2n2 (1 ; n n;t ) + (in ; in;t )2 : (5) Here, we have defined the correlation between pixel i in the current frame and the same pixel in the reference frame to be
h
i
E (f^ni ; in )(f^ni ;t ; in;t ) n n;t = : (6) n2 To compute the value of n n;t from information gathered
from the bitstream, we assume the video is governed by a Gauss-Markov model, which is separable temporally, horizontally, and vertically. Let n n;t be the temporal correlation between frames n and n ; t that remains after motion compensation. That is,
h
i
E (f^ni ; in )(f^nj ;t ; jn;t ) : n n;t = n2
(7)
Also, let the horizontal and vertical spatial correlations between adjacent pixels in the same frame be h and v respectively. If the motion vector is j ; i = u with components (uh uv ) then we can write the correlation between pixels j ju j ju j and i in the same frame as i j = h h v v : The correlation between pixels fni and fnm;t will be greatest if m = j , and will decrease as jj ; mj increases. Then, n n;t = n n;t i j , and (5) becomes
2p0 (n i) = 2n2 (1 ; n n;t jhuh j jvuv j ) + (in ; in;t )2 : (8) Now let us examine n n;t , the temporal correlation after
motion compensation. We can express the energy of the residual that is sent to the decoder for a P-macroblock using the same process as (4).
d2
= =
h
i
h
i
E (d^in )2 = E (f^ni ; f^nj ;t)2 j 2 2 i 2n (1 ; n n;t ) + (n ; n;t ) :
of the seven parameters for each macroblock, received or not. For missing macroblocks, these parameters are estimated from their neighbors. For received macroblocks in I-frames, in , n2 , v , and h can be estimated from bitstream information, while d2 , uh , and uv cannot. For received macroblocks in Pand B-frames, d2 , uh , and uv can be easily estimated from the bitstream information, while n2 , v , h , and in cannot. With the exception of in , whenever the parameter cannot be estimated from the bitstream, we use the most recent estimate from that spatial location in the neighboring frames. For a received I-macroblock, the pixel mean in can be recovered from the bitstream directly, using the DC component of the DCT. The spatial variance n2 can also be extracted directly from the DCT coefficients. For received P- and Bmacroblocks, the only way to exactly compute the macroblock mean is with a complete decode. Therefore, we obtain an estimate of in for these blocks by adding their DC component to a motion-compensated estimate of the mean of the appropriate reference frame. For a received P- or B-macroblock, the parameter d2 (which is a measure of the temporal energy of the macroblock) can be extracted directly using the received DCT coefficients. The motion components u v , and uh are extracted directly from the bitstream for P- and B-macroblocks. The vertical and horizontal spatial correlations v , h , are estimated from the received I-block DCT coefficients. The horizontal spatial correlation is obtained from the DCT coefficients on the same horizontal row as the DC coefficient (denoted D 0 k ) using two steps. First, we compute the horizontal spatial correlation assuming the pixels have mean 128:
h 128 =
k=0
# "X 8
C0 k D Ck 1 = 2 0k
k=0
C0 k D Ck 0 2 0k
#
(11)
where C0 k is the (0 k )-th element of the matrix comprising the DCT transform. Then we adjust the correlation to account for the fact that the pixels actually have a mean of in . The vertical spatial correlation v can be estimated similarly.
(9)
Solving for n n;t and substituting into (9), we have the initial expected squared error due to this macroblock being lost when
"X 8
V. P ERFORMANCE We use eight 10-second MPEG-2 video sequences, divided into two groups of four sequences each. All video has 720
I - 19
➡
➠ TABLE I
TABLE II
S LOPES RELATING PLR TO MSE FOR EIGHT SEQUENCES .
P ERFORMANCE OF METHODS FOR EIGHT SEQUENCES .
Group F G
1 8645 6069
2 20185 9540
3 20750 2551
4 19147 5209
Bound F1 F2 F3 F4 G1 G2 G3 G4
13
Estimated error per macroblock (log10)
12
11
10
9
8
7
6 6
Fig. 2.
7
8 9 10 11 Actual error per macroblock (log10)
Estimated ^mb vs. actual
mb
12
13
for P-frames in sequence F4.
pixels, 480 lines, and 60 fields per second. Each sequence contains 299 frames. To characterize the sensitivity of loss, Table I indicates for each sequence the slope relating PLR to MSE. There is a wide range of activity; sequence F3 has the largest sensitivity to loss, and sequence G3 has the least. Side information, in the form of the average initial loss for each macroblock, can improve the performance of FullParse. When a packet loss is detected, FullParse can determine the initial spatial extent M0 , use the side information to initialize to propagate the the magnitude of the error, and extract initial error. This will be more accurate than using the GaussMarkov model; hence it provides an upper bound. Figure 2 shows the initial error due to a loss estimated using the method in Sections IV-B and IV-C, for each macroblock in the P-frames of sequence F4. Other sequences display similar behavior. In general, the method is reasonably accurate in predicting the MSE for a macroblock. However, the GaussMarkov model typically overestimates small initial MSE but typically underestimates large error. The underestimation of large initial MSE on a marcoblock basis typically causes FullParse to underestimate the sequence MSE, ^seq as well. To illustrate the ability of FullParse to accurately predict MSE on a sequence level, we use a least squares fit of the estimated MSE using the actual MSE, ^ 2seq = b 2seq . We consider 9 different average PLR of MPEG-2 Transport Packets ranging from 0.00005 to 0.005 and inject 25 random loss patterns of each into each of the eight sequences, for a total of 1800 samples of 2seq . Table II shows the slope b and the average residual squared error s 2 = ^ 2seq ; b 2seq ]2 =N for each sequence, for both FullParse and its bound. The slope for the bound is always very close to one and the average residual energy is very small. If correlation among losses [10] had significant impact for the situations considered in this paper, then a linear fit would not be accurate. However, this is not the case.
P
FP
b
s2
b
s2
0.96 0.97 0.99 0.99 0.96 0.99 0.95 0.99
0.82 6.01 4.51 3.38 0.52 2.68 0.18 0.81
1.37 0.75 1.07 0.71 0.74 1.02 0.87 0.69
6.43 43.66 44.44 63.31 1.28 9.25 9.29 2.15
The slope for FullParse is not always near one, because as noted, the current estimation method tends to underestimate large initial errors. However, FullParse has slopes ranging from 0.69 and 1.37 for this set of sequences, with generally small average residual energy as well. We also evaluate the ability of FullParse and its bound to accurately produce an alarm when the MSE exceeds a threshold. For the sequences we considered, FullParse has a probability of error of about 0.04 for all thresholds on the maximum tolerable MSE, while the bound has a probability of error of about 0.02. We expect further improvements on the modeling of macroblock-level MSE will reduce the probability of error for FullParse. The use of side information can improve the performance of the quality monitor. However, its use would require additional reliable bandwidth, and prior to adopting such an approach, the syntax used to describe it would need to be standardized. R EFERENCES [1] S. Hemami and M. Masry, “Perceived quality metrics for low bit rate compressed video”, Int. Conf. on Image Proc. (ICIP), pp. 721–724, Sept. 2002. [2] H. R. Wu, T. Chen, S. Winkler, and Z. Yu, “Vision-model-based impairment metric to evaluate blocking artifacts in digital video”, Proceedings of the IEEE, vol. 90, no. 1, pp. 154 –169, Jan. 2002. [3] S. Wolf and M. Pinson, “In-service performance metrics for MPEG-2 video systems”, IAB, Montreux, Switzerland, Nov 12-13, 1998. [4] M. Knee, “The picture appraisal rating (PAR) – a single-ended picture quality measure for MPEG-2”, Int. Broadcast Convention, 2000. [5] D. Turaga et. al. “No reference PSNR estimation for compressed pictures”, in Int. Conf. on Image Proc. (ICIP), pp. III.61–III.64, Sept. 2002. [6] K. St¨uhlmuller et. al. “Analysis of video transmission over lossy channels”, IEEE J. on Selected Areas in Comm., vol. 18, no. 6, pp. 1012-1032, June 2000. [7] Z. He, J. Cai, and C. W. Chen, “Analytic end-to-end rate distortion modeling and control for packet video over wireless network”, International Workshop on Packet Video, Pittsburgh, PA, April 2002. [8] R. Zhang, S. L. Regunathan, and K. Rose, “Video coding with optimal Inter/Intra-mode switching for packet loss resilience”, IEEE J. on Selected Areas in Comm., vol. 18, no. 6, pp. 966-976, June 2000. [9] D. Wu, Y. T. Hou, B. Li, W. Zhu, Y.-Q. Zhang, and H. J. Chao, “An end-to-end approach for optimal mode selection in Internet video communication: Theory and application”, IEEE J. on Selected Areas in Comm., vol. 18, no. 6, pp. 977-995, June 2000. [10] Y. J. Liang, J. G. Apostolopoulos, B. Girod, “Analysis of packet loss for compressed video: does burst-length matter?”, ICASSP ’03, Hong Kong, April 2003. [11] A. R. Reibman, Y. Sermadevi, and V. Vaishampayan, “Quality monitoring of video over the Internet”, Asilomar Conference on Signals, Systems, and Computers, Nov. 3–6, 2002.
I - 20