AN ERROR CONCEALMENT ALGORITHM FOR STREAMING VIDEO S. Belfiore, M. Grangetto, E. Magli, G. Olmo CERCOM - Center for Multimedia Radio Communications Dipartimento di Elettronica - Politecnico di Torino Corso Duca degli Abruzzi 24 - 10129 Torino - Italy Ph.: +39-011-5644195 - Fax: +39-011-5644099 grangetto(magli,olmo)@polito.it
[email protected] ABSTRACT A known problem in video streaming is that loss of a packet usually results into loss of a whole video frame. In this paper we propose an error concealment algorithm specifically designed to handle this sort of losses. The technique exploits information in a few past frames (namely the motion vectors) in order to estimate the forward motion vectors of the last received frame. This information is used to project the last frame onto an estimate of the missing frame. The algorithm has been tested on MPEG-2 video, providing very satisfactory results, and outperforming by several dBs in PSNR the concealment technique based on repetition of the last received frame. 1. INTRODUCTION Video communications over networks have recently received significant attention from the scientific and industrial communities. Two kinds of video applications are usually considered, namely conversational and streaming. In conversational applications, such as videotelephony or videoconferencing, a peer-to-peer communication is established, and the same operations, e.g. video encoding and decoding, are performed at both sides. In video streaming, a client-server architecture is rather employed. The client queries the server for a specific video stream; a certain amount of data is pre-rolled, and then video data are transmitted from the server and decoded and displayed from the client in real-time. In case of packet losses, the initial delay due to the pre-roll operation usually allows the client to ask for a given number of retransmissions; however, as a matter of fact, a residual packet loss rate subsists. In order to combat the effect of losses, decoded error concealment is highly desirable. Algorithms based on temporal, spatial, and spatiotemporal interpolation have been proposed [1, 2, 3]. These algorithms assume that either a single macroblock (MB), or a slice consisting of several consecutive MBs is lost. The latter case is quite realistic, while the former one requires that a MB interleaving scheme is employed. Information on the neighboring available MB, and on the MBs in the adjacent frames, are used to estimate the missing information, namely the motion vectors (MV) of the missing MB, and its texture information (see [4] for a review). Error concealment algorithms have been very successful in conversational applications. On the contrary, their use in video streaming applications has been quite limited so far. The reason lies in the fact that, at low bit-rate, one complete compressed video
frame is smaller than the MTU for the Ethernet (12000 bits). As a consequence, when a packet is lost, it is most likely that a whole video frame is lost [5]. In this case, the assumptions that underlie most error concealment algorithms are not verified. In particular, the following problems are encountered. i) One cannot estimate the missing MVs from those of the neighboring blocks (if such MVs are available); ii) if the neighboring MVs are not available, one cannot use the neighboring MBs to re-estimate the MVs by means of block matching with respect to a reference frame; iii) one cannot use the neighboring MBs to test the side boundary match of a candidate replacement MB; iv) partial decoding of DCT coefficients is not available; v) syntax-based repairs are not effective due to the bursty nature of the errors; vi) even spatial error concealment is not viable due to the fact that no neighboring MBs are available. Consequently, for video streaming applications one should design an error concealment algorithm that is capable of handling loss of complete frames. Other approaches have also been proposed, such as adaptive media playout [6] and interpolation of missing frames via separate encoding of even and odd frames [7]; however, they are out of the scope of this paper. In this paper we propose an error concealment algorithm which is specifically designed for video streaming applications. A notable feature of this algorithm is its capability of estimating entire missing frames of a streaming video sequence; to our best knowledge, there is no error concealment algorithm in the literature with such ability. Moreover, the algorithm is able to exploit the multiple reference frame buffer, provided by the most recent video coding standards (e.g. H.264, annex N of H.263+) to improve the error concealment results. In particular, the algorithm is based on the estimation of the motion vector field of the missing frame based on the previous received frames, and on projection of the previous frame onto the missing one based on this estimated information. The proposed technique is compared with error concealment by means of repetition of the last frame; results are provided for MPEG-2 video, showing that the proposed algorithm significantly outperforms repetition, with PSNR gains of several dBs, especially in high-motion scenes. 2. PROPOSED ALGORITHM The operational flow of the proposed error concealment algorithm is sketched in Fig. 1. First, one has to select how many past frames are employed by the algorithm; if the current (missing) frame has to . In general, index , we employ the frames from
Frame Buffer
Generation
Motion vector
Spatial MV
of MV history
projection
regularization
Projection onto missing frame
Interpolation of missing pels
Filtering and downsampling
(half−pel res.)
Fig. 1. Block scheme of the proposed algorithm.
should not be larger than 5, since this is the maximum buffer of most encoders supporting multiple reference frames. As will be clear from the algorithm description, what is actually needed by the algorithm are the MVs for the past frames, and the pixel values of frame . The algorithm then performs the operations described below. We indicate with the pixel at coordinates in frame , with and the horizontal and vertical MV components of the MB containing , and with the coding mode of that MB (for simplicity, we only distinguish between I and P frames, for which takes on the value 0 and 1 respectively). Notice that we define MVs for each pixel instead of each MB, since we plan to estimate MVs at the pixel level. 1. Generation of a MV history. The first step consists in analyzing the previous frames from to in order to understand which past pixels were used to predict . To this end, a MV history (MVH) is generated for each pixel , according to the following rules (an example is reported in Fig. 2). Level 1 For , consider its MV ( and ) and store them at level 1 of the history, i.e. and . If (intra MB) then store a termination symbol and stop; if the MVs point at a pixel in the previous frame that is outside the frame boundary (this may occur with video coders that support unrestricted motion vector modes, such as H.263+ and H.264), then store a termination symbol and stop; otherwise proceed to level 2.
pointed at by the MVs Level 2 Consider the pixel in frame stored at the previous level, i.e. ½ ½ . Then consider the MV of this new pixel. If it points at a that is outside the frame boundpixel in the frame ary, then store a termination symbol and stop; if the coding mode of the current pixel ½ ½ is zero, then store a termination symbol and stop; otherwise set ½ ½ , and anal ogously for the vertical component, and proceed to next level. Level 2 The same operations as in level 2 are performed. Notice that, unless the history has been terminated, level of the history contains , with and analogously for the vertical component.
2. MV projection onto missing frame. In the second step, to is used to esthe macroblock history from frame would move into frame timate where each pixel of frame according to a given motion model. In particular, for each pixel we consider its history from level 1 up to level (or up to the termination symbol if it occurs at a level less than ). In the hypothesis that motion in the sequence merely consists of pixels moving from one place to another, the history tells us by how
n−3
n−2
n−1
INTRA MB MV H/V MV H/V
MB 1
MB 3
MB 1
MB 3
MB 1
MB 3
MV H/V
MB 2
MB 4 MV H/V
111111111111111 000000000000000 TERM 000000000000000 111111111111111 000000000000000 111111111111111 TERM 000000000000000 111111111111111
MB 2
MB 4
MB 2
MVH
MVH
MVH
MVH
MB 4
Fig. 2. Generation of the MV history and its termination.
much a pixel has moved from one frame to the following one, from frame to frame . In order to estimate by how much the pixel has moved from frame to frame , we assume a prior model for the motion of each pixel. In particular, we assume a motion with constant velocity from frame (or the deepest level before termination) to frame . This amounts to estimate the forward motion vector of pixel as the mean of all MVs in its history. Namely, we assume that will move into , with and the maximum value of , , for which is not a termination symbol. Analogously one computes . At the end of this step, for each pixel a forward MV is available, whose components are and . We have also tested other temporal interpolators for the MVH, including the median filter and weighted averages with weights linearly and exponentially decaying over time; we have found that the mean value consistently yields the best results, thus validating the linear velocity assumption. 3. Spatial regularization of MV field. In this step the MV field of the missing frame is regularized in order to avoid disconti. To nuities in the MVs of pixels that were neighbors in frame this end, the matrices and are filtered with a two-dimensional median filter; a 12x12 kernel size has turned out to be suitable. As will also be done in the following, the median filter is employed since it is known [8] to approximate an edgepreserving regularization operator based on a Markov random field prior, thus being able to smooth the data though allowing for a controlled amount of edges. After spatially smoothing the MV field, the resulting values of and are rounded off to half-pel accuracy. Moreover, it is checked whether any of the estimated forward MVs points at destinations which are outside of the frame boundary; any such MVs are not considered any longer. Notice that, after this process, forward MV information has been estimated at the pixel level rather than at the MB level; this is partially due to the spatial MV smoothing, and more importantly to the MV history, which can relate pixels in the same MB in frame
with pixels in different MBs in the previous frames. 4. Reconstruction of frame . The reconstruction of the missing frame is initially performed at half-pel resolution, as shown in Fig. 3. Thus, an empty matrix is created, twice as large as the missing frame in both the horizontal and vertical directions. All pixels are scanned in raster order. For each pixel, a 2x2 mask is filled with the value of . Since it move into is possible that two different pixels from frame the same pixel in frame , one must keep track of which pixels of are being filled. Incase of multiple contributions to the same bin, the average of all contributions is used as final value. n (half−pel resolution)
n−1 Forward motion vector
weight
(i,j)
Map of filled pixels
Fig. 3. Reconstruction of frame
at half-pel resolution.
5. Interpolation of missing pixels. Since adjacent pixels in frame can move by different amounts toward frame , it is likely that some pixels of are left unfilled. A map of such pixels results from step 4, since one must keep track of filled pixels to possibly average their contributions. Then the missing pixels are scanned in raster order, and estimated as the median of the pixels in a neighborhood centered around the missing one. A 7x7 size has been found to provide good results. 5. Filtering and downsampling. As final step, in order to obtain an estimated frame of the same size as all other frames, the half-pel estimate is filtered by a 2x2 averaging kernel, and downsampled by a factor two in both the horizontal and vertical directions. 3. EXPERIMENTAL RESULTS The proposed error concealment algorithm has been tested using an MPEG-2 codec. In the following we report results related to two parts of the foreman sequence. The first part ranges from frame 30 to 44, and contains relatively low motion, whereas the second part ranges from frame 180 to 194, and contains fast motion of the camera and the speaker. The sequence has been coded at 300 kbit/s with 30 frames/second and GOPs of 15 frames. For both parts of the sequence, we have carried out the following experiment. The first 5 frames of each GOP have been used to fill the reference frame buffer; then we have successively simulated the loss of one frame (from 6 to 15). We have tried several values of from 1 to 5. We have found that higher values of yield better results in case of low or no motion, whereas lower values are more suitable in case of fast motion, especially for low temporal resolution video, in which two consecutive frames can exhibit significant differences. A good compromise is , which has
been used throughout our simulations. Notice that, since we only perform decoding up to the lost frame, but we do not continue decoding after error concealment, this experiment does not measure error propagation in the frames following the lost one. Error propagation results have shown that higher PSNR on a given concealed frame typically leads to propagation of a smaller error in the following frames; however, these results are not reported here for brevity. Tab. 1 reports the PSNR obtained by the proposed algorithm when each frame is lost. As a term of comparison, we have performed error concealment by repetition of the last frame, since no other existing algorithm is able to deal with the loss of a whole frame. As can be seen, in the high-motion GOP the repetition algorithm yields very poor PSNR, since the difference between two consecutive frames is very high, even at 30 fps. On the contrary, the proposed algorithm is able to estimate the scene motion, and to partially compensate for that, thus yielding reasonably high PSNR, and attaining a gain from 6 to 8 dB with respect to simple repetition. As for the low-motion GOP, very high gains are still achieved, up to 3 dB; these are lower than in the previous case because, in a low-motion scene, the previous frame is always fairly similar to the lost one. In order to show the visual quality of the frames estimated by the proposed algorithm, in Fig. 4-a,b,c,d and 4-e,f,g,h we report the reconstructed frame for the low and high motion GOPs respectively, for frame 8 of the GOP in both cases. In particular, Fig. 4-a shows the missing frame, and Fig. 4-b the frame estimated by the proposed algorithm. As for visual quality, it can be seen that, besides providing a satisfactory PSNR, the algorithm yields very vivid and crisp reconstructions, free of visible artifacts. In fact, the repeated use of the median filter as regularization operator allows to achieve a good trade-off between edge preservation and smoothness on the flat image regions. In order to visually quantify the ability of the proposed algorithm to estimate the missing frame from the motion in the previous ones, in Fig. 4-c and 4-d the differences of the previous frame and of our estimated frame with respect to the missing frame are reported respectively. White and regions represent areas where motion between frame has not been compensated for. As can be seen, the proposed algorithm turns out to be able to compensate for a significant part of the motion; this is especially visible in Fig. 4-g and 4-h, related to the high motion GOP, where the error reduction results into a PSNR gain of 6-8 dB. As for complexity, the following remarks can be done. Although the algorithm performs many operations, most of them merely amount to reorganizing existing data, e.g. the MVs; notice that low values of are employed, thus reducing the size of the data to be analyzed. Repeated use of a median filter (step 3 and 5) and of a one and two-dimensional averaging filter (step 2, 4, and 5) is done; these filters can be efficiently implemented in most architectures. The complexity of the proposed algorithm is comparable with that of a typical temporal error concealment algorithm, and significantly less than any spatial interpolator, thus allowing for real-time decoding with error concealment. 4. CONCLUSIONS In this paper we have proposed a new error concealment algorithm for streaming video. Remarkably, the algorithm has turned out to be capable of satisfactorily handling the loss of whole frames, which is typical of the video streaming case. This is done by ex-
Table 1. PSNR (dB) of the proposed algorithm vs. replacement with the last frame in a low motion and a high motion GOP. Index of lost frame Proposed algorithm Last frame replacement PSNR gain
6 29.88 27.47 2.41
7 30.57 27.86 2.71
8 30.69 27.68 3.01
Index of lost frame Proposed algorithm Last frame replacement PSNR gain
6 23.82 17.33 6.49
7 23.96 17.71 6.25
8 24.57 17.98 6.59
(a)
(b)
(e)
(f)
GOP 1 (low motion) 9 10 11 12 32.01 31.41 31.39 31.54 29.01 29.78 29.86 29.56 3.00 1.63 1.53 1.98 GOP 2 (high motion) 9 10 11 12 24.57 24.48 25.08 25.54 18.23 18.61 19.18 19.79 6.34 5.87 5.90 5.75
13 30.58 29.92 0.66
14 30.87 31.42 -0.55
15 30.35 30.46 -0.11
13 26.65 19.80 6.85
14 27.20 19.92 7.28
15 28.20 20.15 8.05
(c)
(g)
(d)
(h)
Fig. 4. Low motion GOP: (a) missing frame ; (b) Frame estimated by the proposed algorithm; (c) difference between missing frame and frame ; (d) difference between missing frame and the frame estimated by the proposed algorithm. High motion GOP: (e) ; (h) difference missing frame ; (f) Frame estimated by the proposed algorithm; (g) difference between missing frame and frame between missing frame and the frame estimated by the proposed algorithm.
ploiting motion information in a few past frames to estimate forward motion of the last received frame. The algorithm has been tested on MPEG-2 video, and compared with repetition of the last frame. It has been shown that the proposed technique outperforms repetition by several dBs in PSNR, especially in high motion scenes. 5. REFERENCES [1] L. Atzori, F.G.B. De Natale, and C. Perra, “A spatio-temporal concealment technique using boundary matching algorithm and mesh-based warping (BMA-MBW),” IEEE Trans. Multimedia, vol. 3, no. 3, pp. 326–338, Sept. 2001. [2] D.S. Turaga and T. Chen, “Model-based error concealment for wireless video,” IEEE Trans. Circ. Sys. Video Tech., vol. 12, no. 6, pp. 483–495, June 2002. [3] S. Belfiore, L. Cris`a, M. Grangetto, E. Magli, and G. Olmo, “Robust and edge-preserving video error concealment by
coarse-to-fine block replenishment,” ICASSP, 2002.
in Proc. of IEEE
[4] Y. Wang, S. Wenger, J. Wen, and A.K. Katsaggelos, “Error resilient video coding techniques,” IEEE Sig. Proc. Mag., pp. 61–82, July 2000. [5] J. Lu, “Signal processing for Internet video streaming: a review,” in Proc. of SPIE Image and Video Comm. Proc., 2000. [6] M. Kalman, E. Steinbach, and B. Girod, “R-D optimized media streaming enhanced with adaptive media playout,” in Proc. of IEEE ICME, 2002. [7] J.G. Apostolopoulos, “Error-resilient video compression through the use of multiple states,” in Proc. of IEEE ICIP, 2000. [8] P. Salama, N.B. Schroff, and E.J. Delp, “Error concealment in MPEG video streams over ATM networks,” IEEE J. Sel. Areas in Comm., vol. 18, no. 6, pp. 1129–1144, June 2000.