A COMPRESSED-DOMAIN VISUAL INFORMATION EMBEDDING ALGORITHM FOR MPEG2 HDTV STREAMS Bin Yu, Klara Nahrstedt Department of Computer Science University of Illinois at Urbana-Champaign DCL, 1304 W. Springfield, Urbana IL 61801 binyu,
[email protected] ABSTRACT Many features of traditional TV service are becoming desirable for software multimedia applications to provide open solutions and improve flexibility. In this work, we have studied a new kind of service, what we call “visual information embedding”, that comes from the idea of Picture-In-Picture. Based on previous work in MPEG-compressed domain algorithms, we propose a “backtracking” approach that reduces the decoding complexity up to 90% and enables realtime processing on MPEG2 streams at the price of a delay of one Group-Of-Picture period. We have implemented a software realtime visual information embedding gateway for HDTV streams, and the experimental results have shown that our solution is practical and efficient. 1. INTRODUCTION The MPEG-2 standard [1] has been widely accepted as the standard for digital video transmission and storage, and it will gain more popularity as the evolutionary HDTV [2] comes to people’s home. Beyond high quality video and audio, more functionalities and higher interactivity become more and more desirable features for the next generation home TV service. Currently, this job is done by Set-Top Boxes in a centralized, closed fashion. All processing on the video streams, such as Picture-in-Picture, is done at a single point via pixel domain overlaying after decompression. The proprietary nature of Set-Top Boxes means that to change a TV broadcaster, the customers need to purchase another specific set top box. Also, not much programmability is provided for PC applications. Therefore, very limited user-driven customization of the video is possible and the growing computing power of desktop PC is wasted. In such circumstances, PC-based gateway approach provides an alternative that is cheaper and more flexible. General purpose PCs are already available to most users at home, so further enhancing them to process TV streams is a bonus. The digital nature of the TV streams makes PCs a natural choice for manipulating the TV stream to allow more customization and interactivity for the TV viewers. Moreover, new DV transmission standards such as IEEE 1394 [3] allow the MPEG2 stream to be sent directly to 1394-enabled TV sets, saving the cost of an extra decoding device. Meanwhile, broadband networking technologies such as cable modem and xDSL are bringing to our home and office buildings an This work was supported by the NASA grant under contract number NASA NAG 2-1205.
Internet connection with up to 10Mbps or even higher bandwidth at acceptable and reducing prices, enabling high quality Internet TV applications. In this paper, we focus on one special service – “visual information embedding”, or “VIE” for short. The idea of VIE actually comes from the Picture-In-Picture feature of TV, but we extend it to mean overlaying any visual information into the current video frame, such as a small scale video window from another TV program just as in Picture-In-Picture, or some live stream from a digital camera, or sports or weather report, news update, stock information or images downloaded from some Internet data engine. The benefits of such a software approach can be manifold, such as finer customizations and higher interactivity owing to much more control over the TV streams. Besides, the collective information from the Home Network could be easily processed by the PCs and presented on the TV screen. However, because PCs can not decode and then encode the MPEG2 stream in realtime, we need to work in the compressed domain, and the motion compensation mechanism heavily used in MPEG2 leads to a “Missing Reference” problem. To tackle this problem, we need to decode those affected macroblocks, change their prediction errors, and then re-encode them. Previous solutions proposed DCT-compressed domain operations and optimized the re-encoding process, but they still suffer from the intensive decoding process. In fact, we have discovered that a significant amount (could be up to 90%) of macroblocks decoded are not useful, and we propose a new “backtracking” algorithm to identify the minimum set of macroblocks need to be decoded. At the cost of a delay of one Group-Of-Picture (GOP for short) period, this enables on-line processing of even HDTV streams. The remaining of this paper is organized as follows. In section 2, the problems are described in detail, and in section 3 we give our core algorithm for the VIE process. Experimental results are presented in section 4 and the final section concludes. 2. MACROBLOCK REPLACEMENT AND THE WRONG REFERENCE PROBLEM 2.1. Domain Definition To make our discussion simpler, we divide the MPEG2 decoding process into 2 steps. The first step includes stripping off the system layer header, the picture header, the slice header and the macroblock header (address increment, macroblock modes, motion vector, coded block pattern, etc.), and then run-length decod-
ing to get the DCT coefficients. This intermediate layer of macroblock header and DCT coefficients are needed in the motioncompensation (MC), so we call it the “MC domain”. The second step of decoding is to do Inverse-DCT on the MC domain data and then motion compensation to reconstruct the frame data. Note that when the actual pixels of the frame are not needed, such as when the video will not be rendered, the motion compensation could be done directly at DCT-domain [4], saving the cost of the IDCT and DCT operations. In either case, we call the resulting data after motion compensation the “Reconstructed Data” domain, or RD domain. 2.2. Macroblock Replacement
foreground area is the smaller rectangle (FG) and the background frame area is the large rectangle (BG). The small squares represent the macroblocks (MB). Assume MB2 in frame P2 uses the data of MB1 in frame P1 as a content predictor. Since the VIE process changed MB1’s data, MB2 will not be correctly decoded if it still uses the original motion vector and prediction error. To make it worse, this wrongly decoded macroblock may then be used as a reference itself for other macroblocks in later frames, causing the error to propagate all the way down the motion prediction link until the next I frame appears. To fix MB2’s MC data, we need to know MB2’s RD domain data, and therefore MB1’s. Also note that MB1 itself may rely on the data from another macroblock (MB0 in Figure 2) for prediction reference. Therefore, to know MB1’s value, MB0 will also need to be decoded to RD domain. In the following discussion, we will call macroblocks like MB0 or MB1 “d-MBs” since their Data is needed to be decoded to RD domain for future reference by other macroblocks. Similarly, we will call macroblocks like MB2 “c-MBs” as their reference blocks are wrong and so their MC data has been Changed. 2.4. Related Work
Figure 1: Goal for Information Embedding Now we are ready to discuss the overlaying process, as shown in Figure 1. The original frame and the resulting frame are both in standard MPEG2 format, and we assume that the overlaying step of filtering should be transparent to any video players, either software or hardware. Also, we assume the data content to be embedded is already in a VIE-friendly format, that is, they have appropriate motion vectors and DCT-coefficients, such as original MPEG clip itself. The idea of overlaying is straightforward: we only decode the stream half way to the MC domain, and replace those macroblocks of the original background frame within the foreground window with macroblocks from the new content; then, we pack the whole stream back into MPEG2 system layer format. This way, there could be any number of non-overlaying foreground windows, and each window can be of any size and positioned anywhere inside the whole screen so long as it is aligned with the original macroblock boundaries and they do not overlay with each other. 2.3. Wrong Reference Problem BG
I
FG
MB0 BG
FG
P1 P2
MB4
Many MPEG compressed domain algorithms have been developed, among which [4], [5] and [6] provide a good starting point for our work. In [4], Noguchi, Messerschmitt and Shih-Fu proposed to decode all reference frames to the RD domain in case future macroblocks will need them for reference, re-calculate the prediction errors of c-MBs, and then re-encode them back to the MC domain for further encoding. They thoroughly defined how to do motion compensation and motion estimation directly in the DCT domain to save computation and prevent losses associated with DCT/IDCT operations. As for the new motion estimation, they proposed a heuristic called “Inference” to only examine 2 most likely candidate reference macroblocks located at the edge of the foreground window, such as MB3 and MB4 in Figure 2. However, decoding all reference frames to the RD domain is still a very costly step that prevents efficient on-line processing. Later, in [5] and [6], other algorithms for doing visible watermarking and captioning were proposed based on the assumption that the embedded content are “added” to the background frame instead of “replacing”. In such cases, the c-MBs’ new prediction errors can be calculated without knowing the referenced d-MBs’ RD domain data, so the intensive decoding is by-passed. However, they assumed the foreground data’s maximum luminance value does not differ much from the average value and can be adapted according to the background value. Though these assumptions generally hold for visible watermarks and captions, they do not apply in our case where the foreground content is more important to be shown clearly and stably. 3. SOLUTION
BG
MB1 MB3
FG
MB2
Figure 2: The Wrong Reference Problem The dependance between MPEG frames leads to a serious problem when we replace part of one frame with some new content. Consider the chain of I → P1 → P2 frames in Figure 2. The
Our task is to further reduce the computation involved in the decoding process, and the solution is based on two key observations. First, most of the macroblocks (up to 90% according to experimental results) decoded to the RD domain are not used at all if we use the “Inference” heuristic for the motion re-estimation for c-MBs. These macroblocks are decoded in previous work only because it is not known in advance exactly which subset of MBs will be used in the future. Second, various delays already exist in today’s distributed video applications, such as queuing delay in the network
nipulation operations such as video clipping and video juxtaposition. 3.2. Pseudo-code
Figure 3: A typical inverse motion prediction link discovered by backtracking
In summary, we give the pseudo-code of our VIE algorithm. We assume only one foreground window and each GOP contains 15 frames, in the pattern IBBPBBPBBPBBPBB. Because the frames come in a different order from the decoding order, the last frame in a group is the second frame (a B frame) after the I frame of the next GOP, as shown in Figure 4. Also note that to work in a “delayed
and buffering delay at receivers. Therefore, it might be desirable if we could reduce the computation complexity at the cost of a little longer delay. 3.1. The Backtracking Process Assuming a longer delay is acceptable, we can simulate “predicting the future” by “buffering the past”. That is, we decode each frame to the MC domain and buffer the motion vectors and quantized DCT coefficients. After we have gone through the whole GOP, all the c-MBs can be identified by testing whether a macroblock’s motion vectors are pointing to somewhere inside the foreground area. Similarly, d-MBs can be identified if some future d-MBs or c-MBs are using it as prediction reference. Therefore, if we follow the motion prediction in an inverse direction, we could determine exactly which macroblocks are d-MBs and c-MBs. We call this a “backtracking” process, and a typical inverse motion prediction link discovered is shown in Figure 3. Note that at each step the number of macroblocks may double or quadruple since the prediction does not follow regular 8x8 block boundaries. In the worse case, for a Group-Of-Picture pattern of IBBPBBPBBPBBPBB, the data of one macroblock in the first I frame may be referenced 4 times to derive the content of macroblocks in all the 4 following P frames and many other macroblocks in B frames. For a maximum search distance of 16 macroblocks used for searching the optimal prediction block by the encoder, this means the prediction link may stride 4x16=64 macroblocks across the frame. Once the c-MBs and d-MBs are marked out, we can resume the decoding and overlaying process from the I frame at the beginning of this GOP again. By this time we have the MC domain data, we can continue the VIE processing like this: only for those c-MBs and d-MBs, we need to perform motion compensation to get their RD domain data, and only for c-MBs we perform motion estimation to get their new motion modes and prediction errors. For other macroblocks, we do nothing, and their MC domain data will be directly used in later re-encoding phase. We want to point out that this backtracking is always needed within the motion compensation process for any solution, so it does not incur any extra cost. Rather, we are just batching up the tracking for a GOP of frames till the end of the GOP instead of doing it separately for each frame. At the price of delaying the stream for one GOP period, a great saving can be expected since only cMBs and d-MBs are motion-compensation-wise decoded, which are typically much less than the total number of macroblocks in I and P frames. In addition, the idea of trading time for knowledge about the minimum set of macroblocks for motion compensation through backtracking can also be applied to many other video ma-
Frame coming order
I B B P B B P B B P B B P B B I B B
Frames tested for r-MBs
- - - P B B P B B P B B P B B - B B
Frames tested for d-MBs
I - - P - - P - - P - - P - - I - -
Figure 4: Frame pattern on-line” fashion, we still decode one new frame and encode one changed frame in each iteration, only that the frame being encoded is one GOP period older than the frame being decoded. /* starting from the first frame */ i = 0; while ( true ) { /* decode one new frame to MC domain */ for each MB in the ith frame { decode and store MB headers and DCT coefficients; } /* accumulate at least a whole GOP ( 15 + 2 B-frames) before backtracking and encoding could start. */ if (i < 17 ) continue; /* run the back_tracking algorithm for this GOP */ if ( i MOD 15 == 2 ) { for each inter-coded frame F in this GOP backward { for each MB M in F { if (M refers to FG MBs in reference frame(s)) { mark M as a c-MB; mark the referred MBs as d-MBs; } if (M is marked as a d-MB) { mark the MBs M refers to as d-MBs; } } } } /* re-encode from MC or RD domain data */ for each MB M in the (i-14)th frame { if (M is in the foreground window) { insert a new MB from the new content; continue; } if (M is a c-MB or d-MB) decode and buffer RD domain data; if (M is not a c-MB) { encode x using stored MC domain data; } else { locate the new candidate MB Y for reference on the FG window edge; calculate the new motion vector V from M to Y; calculate the difference D between M and Y; encode M using V and D. } } i ++; }
4. IMPLEMENTATION AND EXPERIMENTAL RESULTS We have implemented the first version of the realtime VIE gateway with functions such as video Picture-in-Picture and image embedding. Since we have only changed the timing of the operations, the resulting visual quality is expected to be the same as from previous solutions. Therefore, we analyze the performance of proposed VIE approach in the following 2 aspects:
Football
Stars
Trees
Previous Approach
1632000
1632000
1632000
Our Approach
139455
55509
56613
Table 1: Comparison of the number of macroblocks that need motion compensation • Realtime Processing The VIE service gateway runs on a general purpose PC with a single pentium IV 1.4GHz processor and 512M memory. As we have expected, we can perform many VIE functions such as PiP in realtime, introducing an extra delay of at most 0.5 seconds. For this experiment, we have chosen the background stream to be a HD quality (1920*1088) stream “Trees1.mpg”, and the foreground stream to be a standard resolution MPEG video “football sd.mpg” (480*256). The foreground window located at the left-top corner of the background frame. • Computational Complexity Reduction As we have analyzed above and also it was pointed out by [4], a most expensive work done using the [4] approach is the decoding of reference macroblocks from MC domain to RD domain, and our approach is exactly solving this problem based on the observation that normally great portion of these macroblocks are not used at all. To demonstrate this, we need to compare the number of macroblocks that are decoded to the RD domain by the 2 approaches. In the [4] approach, all macroblocks in all I and P frames should be counted, so for the “football in trees” example, the total number of macroblocks decoded 6 to RD domain for 500 frames would be 1920∗1088 ∗ 15 ∗ 16∗16 500 = 1632000. On the other hand, for our approach, the total count is dependant upon the nature of the background stream since we only work on affected macroblocks. For 3 different background streams, the total number of macroblocks reconstructed through motion compensation is given by Table 1. From the table we know that with our approach the most expensive motion compensation operation only needs to be done for less than 10% of previous approach, and for background streams such as Stars and Trees this percentage is even smaller. To further explain the saving, we have plotted out the distribution of c-MBs and d-MBs over the whole background area, as shown in Figure 5 and 6. As expected, c-MBs only appear around the foreground area, and the number of d-MBs also decrease dramatically going farther away from the foreground area. 5. CONCLUSION In this paper, we have presented a new cost-effective and flexibly controllable approach for doing VIE in MPEG compressed domain. Compared with previous work, our approach greatly reduces computational complexity by maximally eliminating unnecessary motion compensation operations by sacrificing a small amount of delay. Specifically a backtracking operation determines the set of macroblocks whose image value are needed for future reference and whose prediction reference is wrong. This way, only the minimum amount of work necessary for correct embedding results is
Figure 5: Distribution of c-MBs
Figure 6: Distribution of d-MBs done, and the saving for the motion compensation phase could be over 90%. We have implemented a first version of the realtime VIE gateway that can be used to embed video and images in realtime even for high definition MPEG streams. In the future work, we will further optimize the VIE process for realtime interactivity and explore how to apply it to advanced distributed TV applications with higher customization and interactivity. 6. REFERENCES [1] ISO/IEC International Standard 13818, “Generic coding of moving pictures and associated audio information,” 1994. [2] C. Bezzan, “High definition TV: its history and perspective,” Telecommunications Symposium, 1990. ITS ’90 Symposium Record., SBT/IEEE International, 1990. [3] D. Thompson, “IEEE 1394: changing the way we do multimedia communications,” IEEE Multimedia , Volume: 7 Issue: 2, April-June 2000, Page(s): 94 -100. [4] Messerschmitt D.G. Noguchi Y. and Shih-Fu Chang, “MPEG video compositing in the compressed domain ,” Circuits and Systems, 1996. ISCAS ’96., Connecting the World., 1996 IEEE International Symposium on , Volume: 2, 1996. [5] J. Meng and S.-F. Chang, “ Embedding Visible Watermark in Compressed Video Stream,” Proceedings, 1998 International Conference on Image Processing (ICIP’98), Chicago, Illinois, Oct. 1998. [6] S. Hong J. Nang, O. Kwon, “Caption processing for MPEG video in MC-DCT compressed domain,” Proceedings of the 8th ACM international conference, Oct. 2000.