MPEG-2 to WMV Transcoder With Adaptive Error ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
Guobin (Jacky) Shen, Yuwen He, Wanyong Cao, and Shipeng Li. Abstract—In ...... [29] J. Yeh and G. Cheung, “Complexity scalable mode-based H.263 video.
1460

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

MPEG-2 to WMV Transcoder With Adaptive Error Compensation and Dynamic Switches Guobin (Jacky) Shen, Yuwen He, Wanyong Cao, and Shipeng Li

Abstract—In this paper, we study the problem of video transcoding from MPEG-2 to Windows Media Video (WMV) format, together with several desired functionalities such as bit-rate reduction and spatial resolution downscaling. Based on in-depth analysis of error propagation behavior, we propose two architectures (for different typical application scenarios) that are unique in their complexity scalability and adaptive drifting error control, which in return provide a mechanism to achieve a desired tradeoff between complexity and quality. We perform extensive experiments for various design targets such as complexity, scalability, performance tradeoff, and drifting control effect. The proposed transcoding architectures can be straightforwardly applied to the MPEG-2 to MPEG-4 transcoding applications due to the significant overlap between the MPEG-4 and WMV coding technology. Index Terms—Adaptive drifting error control, complexity scalability, error accumulation, error compensation, MPEG-2 to MPEG-4 transcoding, MPEG-2 to Windows Media Video (WMV) transcoding, transcoding.

I. INTRODUCTION RANSCODING refers to the general process of converting one compressed bit stream into another compressed one. A transcoding process is usually triggered when there is a need to change the coding parameters such as the bit rate (including variable bit rate to constant bit rate), the frame rate, the spatial resolution, and/or their combinations. It is also required to convert a bit stream in one coding format to another coding format such as from MPEG-2 to MPEG-4 to H.264 [1], [2] or even to a scalable format [3]–[5]. Transcoding is also an effective way to achieve some specific functionalities such as logo insertion, VCR-like functionality for streaming, or enhancing error resilience capability for transmission over wireless channels [6]–[8]. The universal access requirement to multimedia content is ever increasing along with the more diversified access manners and heterogeneous terminal devices. As a result, there is an urgent requirement for transcoders to convert video contents from one format to another, together with parameter changes, to better serve different purposes. We are particularly interested

T

Manuscript received October 3, 2005; revised April 25, 2006. This paper was recommended by Associate Editor H. Sun. G. Shen, W. Cao, and S. Li are with Microsoft Research Asia, Beijing 100088, China (e-mail: [email protected]; [email protected]; [email protected]). Y. He is with the Panasonic Research Laboratory, 440001 Singapore (e-mail: [email protected]). Color versions of Figs. 1–8 and 17 are available online at http://ieeexplore. org. Digital Object Identifier 10.1109/TCSVT.2006.884008

in transcoding from the MPEG-2 format to the Windows Media Video (WMV) format [9], given the dominant position of MPEG-2 in content space and that of WMV in the streaming world. Since WMV has been a proprietary codec until recently submitted to SMPTE (i.e., accessible to SMPTE members but not common users) for standardization purposes and there is little literature about it, we will first briefly introduce it and refer readers to [9] for details. We will use WMV and VC-1 interchangeably hereafter since WMV is often referred to by its SMPTE codename VC-1. WMV, like its predecessors such as MPEG-1/2/4, uses a block-based predictive and transformative coding framework. For instance, it exploits temporal redundancy via block-based translational motion compensated prediction and spatial redundancy via block-based spatial transform. However, WMV is indeed superior to its predecessors in the following ways: 1) the syntax of WMV is much more flexible,1 which implies significantly larger optimization space and 2) many advanced coding techniques such as adaptive block size transform, integer DCT-akin transform, quarter-pel motion precision and bicubic interpolation filtering, advanced entropy coding, overlapped transform, and fading compensation are adopted by WMV. As a result, the coding efficiency (also the visual quality) of WMV is significantly improved and is comparable to that of H.264 while at much lower complexities. Back to transcoding, comprehensive overviews of the transcoding technologies and their various applications are presented in [10] and [11]. In general, a transcoding algorithm needs to effectively utilize the side information carried in the input bit stream to scale back the computation or to improve the coding efficiency in the (partial) reencoding stage. The majority of transcoding research is focused on complexity reduction while trying to minimize quality loss. In this paper, we hope to design a transcoder that provides means for the application to achieve desired tradeoffs between speed and quality besides other typical transcoding requirements such as bit-rate reduction and/or spatial resolution reduction (we focus on the more typical 2:1 downscaling application). The main contribution of this paper is proposing several architectures that are unique in their complexity scalability and efficient control over the drifting error, which in return provides a flexible mechanism to achieve desired tradeoffs between complexity and quality using dynamic switches. The minor contributions include: 1) the in-depth analysis of the error propagation behavior in transcoding, which is the theoretical foundation of the proposed architectures and 2) mergence of 1Syntax of WMV is essentially a superset to that of MPEG-4, therefore this work can also be applied to MPEG-2 to MPEG-4 transcoding applications. We will point out differences in the paper.

1051-8215/$20.00 © 2006 IEEE

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

1461

Fig. 1. Simplified closed-loop transcoder.

normal DCT/IDCT transform and integer DCT-akin transform in transcoding and its performance study, which is very informative for transcoding to H.264 in which only integer transforms are supported. We would point out that there are many ways to improve transcoding speed, from using more powerful computer to parallel and distributed computing, and to using hardware accelerator. However, we purely focus on the algorithmic simplification, which can potentially be combined with other accelerating methods in a straightforward way. The remainder of this paper is organized as follows. In Section II, we perform a short literature survey with emphasis on works that are closely related to this work. We present the MPEG-2 to WMV transcoders with adaptive error compensation and dynamic switches for different application scenarios in Section III. Extensive experimental results are presented in Section IV. Finally, in Section V, some conclusions are drawn and several findings are discussed. II. RELATED WORK A. Cascaded Transcoders and Their Simplifications As also mentioned in many other transcoding papers, the most straightforward transcoding technique, referred to as cascaded pixel-domain transcoder (CPDT), is to cascade a front-end decoder that decodes the input bit stream with an encoder that generates a new bit stream with a different coding parameter set or into a new format. The pros of CPDT are in its extreme flexibility: any desired functionality can be achieved. The cons are, however, also very obvious: the complexity is often a prohibitory obstacle for practical deployment. As a result, this scheme is more often used as a performance benchmark for the improved schemes. An immediate improvement of CPDT is to exploit the motion information from the incoming bit stream in the encoder, typically through a motion vector (MV) filtering process, since motion estimation is generally known as the most computational expensive part. Unfortunately, even the motion information is reused and only a motion vector refinement process is performed or the motion estimation process is completely left out, this scheme is still computationally expensive. Based on the assumption that DCT and IDCT linear operations, the CPDT scheme can be simplified to a so-called cascaded DCT-domain transcoder (CDDT) with the functionality set being limited to spatial/temporal resolution

downscaling and coding parameter changes [12]–[14]. Compared to CPDT, the DCT/IDCT processes are eliminated in CDDT. However, the CDDT scheme needs to perform MC in the DCT domain which is usually cumbersome because the DCT blocks are usually not aligned with MC blocks. As a result, complex floating point matrix operations have to be applied even though fast algorithms are proved to exist. It is also generally not feasible to perform MV refinement in this scheme. By compromising the motion compensation accuracy, i.e., we assume that the motion compensation process is also a linear operation, the CDDT scheme can be further simplified by merging the DCT-domain motion compensation modules. As a result, the DCT-domain MC modules and frame buffers are reduced by half. However, the requirement of performing MC in the DCT-domain still remains. To work around, one can easily break down the DCT-domain MC into an IDCT-MC-DCT process, as shown in the shaded block in Fig. 1. In the figure, a feedback loop exists that performs compensation for both the requantization error and the motion error. This architecture is termed simplified closed-loop transcoder in this paper. In some situations in which extreme fast transcoding speed is desired, the simplified closed-loop transcoder can be further reduced to an open-loop architecture by removing the whole feedback loop. As expected, the fast speed is achieved at a significant penalty on the quality. B. Complexity Scalable Transcoder (CST) Comparing the closed-loop and open-loop transcoders, it is possible yet desirable to provide a mechanism such that a user controllable tradeoff between the complexity and the coding efficiency can be achieved. In [15], the author performed an interesting study by designing a complexity scalable MPEG-2 transcoder (referred to as CST hereafter) for bit-rate reduction with graceful quality degradation. The resulting transcoding architecture is depicted in Fig. 2. Obviously, the complexity scalability is provided by the two switches, namely the motion compensation selector and the reconstruction selector. The author listed all four possible combinations of the two selectors that lead to four different schemes with different complexity offerings. The two selectors are applied on a macroblock (MB) level and, therefore, are able to provide fine complexity control. In the paper, the author also suggests controlling the complexity of the scalable transcoding scheme by a single parameter, i.e., the number of reconstructed MBs. Due to the predic-

1462

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 2. CST (transrater).

tive nature of the coding algorithm, this parameter should be decreasing across a group of pictures (GOP) where the decreasing speed has to be estimated empirically. With the constraint on the target reconstructed MB number, the strategies on the motion compensation selector and the reconstruction selector were presented. More specifically, the ON/OFF status of the reconstruction selector is determined according to total energy of the reconstruction energy (quantization error), and the decision of the motion compensation selector is made according to the temporal prediction of the requantization errors from the reconstruction of the previous picture. In [16], we further improved the scheme by adopting fast requantization process based on a lookup-table (LUT) technique and by providing a finer drifting control mechanism via a triplethreshold algorithm. C. Spatial Resolution Downscaling Transcoder The main purpose of a transcoder is to provide bridges for universal media access. Because of the diversity of network access methods and that of the end devices’ capabilities, research on transcoders with spatial resolution reduction and/or temporal reduction has been quite active. In this paper, we focus on 2:1 downscaling, i.e., downscaling the original frame by two in both the horizontal and the vertical directions, since it is the most typical application scenario where spatial resolution downscaling is desired. There also exist considerable works on arbitrarily downsizing transcoding [17]–[20]. The work in this direction generally covers three aspects, namely, DCT-domain downscaling, mode and motion vector composition, and drifting error control. 1) DCT-Domain Downscaling: In the CPDT architecture, the downscaling process is achieved in the spatial domain via a low-pass filtering and decimation process. However, it was shown in [21] that the low-pass filtering and decimation can be combined and performed directly in the DCT domain which leads to high computation savings because the spatially downscaled signal needs not to go through the same DCT process again which would have to be in the CPDT scheme. Specifically, one can make use of the fact that directly per4 inverse DCT on the 4 4 low-frequency forming a 4 8 block gives a low-pass filtered and coefficients of an 8 8 block. half-decimated version of the original spatial 8 It was also advocated that, using DCT-domain downscaling,

both the PSNR and the visual quality are better besides the complexity reduction. In [22] the authors showed that the DCT-domain 2:1 downscaling is better than those obtained via spatial domain bilinear filtering and even the seven-tap filtering . with coefficients DCT-domain downscaling is used widely in the applications with spatial resolution reduction, especially for 2:1 downscaling cases [14], where standard DCT/IDCT are used. We also make use of this technique in this work. We further extend the conclusion that the technique is applicable for other DCT-like transforms such as the integer transforms in WMV and H.264. 2) Motion Vector Composition: Many works devoted for the efficient and effective reuse of motion information with respect to changes of different coding parameters such as spatial resolution and/or frame rate [17]–[19], [23]–[28]. The common idea is to derive the new candidate MV from the four MVs associated with the four corresponding MBs at the original resolution. Different weighting methods were proposed such as overlapping area weighting [17], [26], align-to-average weighting [28], align-to-best weighting [27], align-to-worst weighting [28]. Note that weighting [27], and align-toMV filtering is not always necessary if the target format supports finer MV coding modes. For example, when transcoding from MPEG-2 to MPEG-4, one may turn on the four-MV mode in MPEG-4 and simply scale down by half the MVs from the MPEG-2 bit stream and associate the resulting MVs with the 8 blocks in the MB at the reduced resolution. Of four 8 course, one may still perform MV filtering so as to code the MB with a one-MV mode, instead of a four-MV mode for coding efficiency optimization. It is also possible to perform refining MV search using the composed MV as a seed to further improve the coding efficiency. In several references it was advocated that a search range as large as three half-pixel units is good enough for MV refinement [27], [29], [30]. A predictive motion estimation (PME) method was proposed in [31] which evaluates the predicted MV and the four original MVs and selects the one leading to minimal distortion. 3) Mode Composition: Because of the syntax constraint, we may have to compose a new coding mode for the MB at reduced resolution from the corresponding coding modes of the four original MBs. If the original four MBs have the same coding mode, the new downscaled MB can simply take that coding

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

1463

Fig. 3. Block diagram of the cascaded pixel-domain transcoder for MPEG-2 to WMV transcoding.

mode. Unfortunately, as is often the case, the four original MBs have different coding modes, and it is challenging to decide a good coding mode. This process is called mixed-block processing. Typical mode-decision strategy is majority-based decision. However, other various weighted decision logics are possible. Basically, there are two possibilities for mixed-block processing: one is to modify Intra mode to Inter mode (called Intra-to-Inter) and the other is to modify Inter to Intra mode (called Inter-to-Intra) [32]. It should be noted that, to implement these mode modification mechanisms (except zero-out, the simplest case of Intra-to-Inter modification by forcing skipping the Intra block), a decoding loop to reconstruct the full resolution picture is required. The reconstructed data is used as a reference to convert the DCT coefficients from Intra-to-Inter or Inter-to-Intra. Attention should be paid to the fact that mode change may affect MV prediction of neighboring MBs. The performance of Inter-to-Intra is a little better than the Intra-to-Inter because converting Inter blocks to Intra blocks effectively stops the drifting error. 4) Drifting Error Control: In [32], a detailed error analysis was conducted using the CPDT scheme as a reference. It was concluded that the drifting error consists of two parts: one part is introduced by the requantization process and the other part is due to the noncommutative property of motion compensation and the downscaling process which is unique to the resolution reduction transcoding. After thorough analysis, two different drifting error control mechanisms were proposed, namely, drifting compensation and drifting avoidance. The drifting compensation is generally achieved via a motion compensation process on the reduced resolution pictures or alternatively in the original resolution while the drifting avoidance is achieved using Intra refreshing technique. III. PROPOSED MPEG-2 TO WMV TRANSCODER: AEC-DST A. Reference Cascaded Pixel-Domain Transcoder We first present in Fig. 3 the cascaded pixel-domain MPEG-2 to WMV transcoder (M2W-CPDT) that will serve as the reference for the derivation of the proposed transcoders. For the sake of easier expression, we embedded some symbols in the

TABLE I DEFINITION OF SYMBOLS IN FIG. 3 AND ALL THE DERIVATIONS. AS A GENERAL RULE, MEANS SIGNAL IN DCT DOMAIN AND MEANS SIGNAL IN VC1-T DOMAIN. SUBSCRIPT i AND i INDICATE THE iTH AND i TH FRAMES, RESPECTIVELY

^

+1

+1

~

figure and the meanings of these symbols are listed in Table I and indicate below. Throughout the paper, subscripts operations or parameters in the MPEG-2 decoding stage and and the WMV encoding stage, respectively. We use to represent motion vectors at the resolution of incoming video and outgoing video, respectively. For example, symbol means motion vector in the original resolution in the MPEG-2 decoder. B. Bit-Rate Reduction Without Resolution Change 1) Merge the Motion Compensation: Since there is no resolution change, the filtering process bridging the MPEG-2 decoder and the VC-1 encoder is not in effect, i.e., an all-pass filter. Therefore, the input to the VC-1 encoder for frame can be expressed as

(1) WMV provides higher coding efficiency than MPEG-2, partly due to the support of finer motion precision. WMV also supports a better but more complex bicubic interpolation for MC filtering, besides the bilinear interpolation, which is slightly different from the one used in MPEG-2. The difference is that WMV adopts a rounding control mechanism while MPEG-2

1464

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 4. Simplified closed-loop cascaded pixel-domain transcoder for MPEG-2 to WMV transcoding, with only motion compensation modules being merged.

does not. To achieve high speed, we stick to the half-pixel motion accuracy in the VC-1 encoder due to two reasons. On the one hand, because of the lack of the absolute original frame (i.e., what we have as input is an already distorted signal), it is difficult to obtain a more accurate yet meaningful motion vector;2 on the other hand, it provides the possibility to merge the motion compensation process in the MPEG-2 decoder with the one in the VC-1 encoder. In other words, the VC-1 encoder can directly reuse the motion information obtained from the MPEG-2 ). If we further redecoder (i.e., assume strict the VC-1 encoder to use bilinear interpolation and force the rounding control parameter in VC-1 encoder to be always off, then under the reasonable assumption that motion compensation is a linear operation and ignoring the rounding error (i.e., ), (1) can be simplified to assume (2) According to (2), the reference CPDT scheme in Fig. 3 can be simplified to the architecture shown in Fig. 4. Clearly, the new architecture leads to significant complexity reduction. However, it is still a bit more complex than the one shown in Fig. 1, because the IDCT in the MPEG-2 decoder and the transform in the WMV encoder cannot cancel out each other. 2) Merge the Transform: In MPEG-2, the standard floating point DCT/IDCT is used whereas the integer transform,3 (referred to as VC1-T hereafter) whose energy packing property is akin to DCT, is adopted in WMV. The integer transform in WMV is carefully designed with all of the transform coefficients to be small integers such that it can be achieved using 16-bit accuracy and is thus very friendly to MMX implementation. The transform matrices for 8 8 and 4 4 sized transforms are provided in [9]. Below, we will show that the architecture shown in Fig. 4 can be the standard 8 8 DCT transform still be simplified. Let matrix; , the 8 8 VC1-T transform matrix; , the inverse 2Generally speaking, motion refinement is meaningful only when the input is of very high quality. 3Note that the integer transform is different from the integer implementation of standard DCT/IDCT.

quantized MPEG-2 DCT block; , the IDCT of ; and the VC1-T of , then can be directly computed from

be as (3)

where denotes element-wise multiplication of two mais the normalization matrix for VC1-T trices, and , with transform which is calculated as . and are very close to It is easy to verify that diagonal matrices. If we apply the approximation, then (3) turns out to be an element-wise scaling of matrix , that is (4) where

can be precomputed as

Equation (4) implies that the VC1-T in the WMV encoder and the IDCT in the MPEG-2 decoder can indeed be merged. Consequently, the architecture in Fig. 4 can be further simplified to the one shown in Fig. 5. Detailed comparison reveals that the IDCT modules are replaced by a simple scaling module and the clipping process is saved as well. Since the 16-bit arithmetic property of VC1-T is friendly to parallel processing, the computation is actually significantly reduced. Moreover, since all of the eleare so close to each other, it can ments of the scaling matrix be replaced by a scalar multiplication in practice. We will study the impact of such approximation to the transcoding quality in our experiments. 3) Complexity Scalability With Dynamic Switches: Note that the architecture shown in Fig. 5 is a closed-loop transcoders because a feedback loop is involved in each architecture. The whole purpose of the feedback loop (including inverse VC-1 quantization, inverse VC1-T, residue error accumulation and motion compensation on the accumulated error) is to provide compensation of the error caused by the VC-1 requantization process. The requantization error is the main cause of the drifting error for a bit-rate-reduction transcoder. Still, due to

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

1465

Fig. 5. Simplified closed-loop cascaded pixel-domain transcoder for MPEG-2 to WMV transcoding, with both motion compensation and transform modules merged.

Fig. 6. Proposed AEC-DST for MPEG-2 to WMV transcoding.

the assumptions leading to the simplification, the architecture is not completely drift-free even with residue error compensation. However, the drifting error is very small since the only remaining cause is the rounding error during MC filtering. One merit of residue error compensation is that we can dynamically turn on or off the compensation process, as will be elaborated. In order to further accelerate the transcoding speed after significant drift-free simplification, some tradeoff between the complexity and the quality has to be made. In other words, we may allow some drifting error to pursue even faster transcoding speed. With this strategy, the most important thing becomes that the drifting error introduced in the faster method should be fully controllable so as to be limited to a certain level. Based , and ) are added on this consideration, three switches ( to the architecture shown in Fig. 5. This leads to the proposed adaptive error compensation and dynamic switch transcoder (AEC-DST) which is shown in Fig. 6. We want to emphasize that such switches can be added only to architectures with residue-error compensation. It is clear from the positions of the switches that their main purposes are to selectively skip some time-consuming operations such that the complexity is significantly reduced but only little error is introduced. Another key point is that the deci-

TABLE II FUNCTIONALITIES OF VARIOUS SWITCHES IN AEC-DST. REFER TO FIGS. 6–8 FOR THEIR POSITIONS

sions for these switches can be made in a computationally efficient way. The meanings of various switches are summarized in Table II. More specific meanings are given here. controls whether or not to accumulate the re1) Switch quantization error into the residue-error buffer. Its role is the same as the reconstruction selector in CST. As a result, all of the observations made there can be taken into consideration. For example, we can use the DCT-domain energy difference as the indicator. controls whether to perform error accumulation 2) Switch or to perform error update. We create a binary activity mask for the reference frame which can be obtained extremely

1466

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

easily during the error accumulation process. Each element of the activity mask corresponds to the activeness of an 8 8 block of pixels and is determined by the energy of the block in the accumulated residue error buffer. We may either perform residue error accumulation if the MV points to a low activity area or perform residue error update by adding the residue error to the incoming residue signal if the MV points to a high activity area. controls whether or not to perform motion 3) Switch compensation on the accumulated residue error, which is the most time-consuming module. This switch is similar to the motion compensation selector in CST. If the MV points to a low activity area, then MC on the accumulated residue error for that specific block can be skipped. Note is optional which is typically turned on that Switch unless we are willing to trade off more quality for speed is in effect, it will nullify Switch . and that if Switch Before preceding further, we want to point out that, despite the similarity between the proposed AEC-DST scheme and the CST scheme in [15], the two schemes actually have big differences in that AEC-DST offers an error propagation control mechanism while CST does not. If we always close Switch to the position (update), then the two schemes are equivalent except that we are handling different transforms. However, if is connected to the position (accumulation), we Switch can effectively accumulate the errors that are not large enough to trigger the update process. Moreover, in CST, once an error is introduced by turning off any switch, the error will propagate for good until stopped by another Intra frame, while in AEC-DST the error propagation will automatically be stopped once it exceeds a certain threshold. We provide a detailed analysis on this in the Appendix. In practice, we found that such error accumulation contributes critically to the overall visual quality. Note that the purpose of the switches is to achieve a better tradeoff between quality and speed. Due to the predictive nature of video coding, it is generally beneficial to adjust the thresholds for these switches such that earlier reference frames are processed with higher quality and at relatively slower speed, which is also advocated in [15]. C. Bit-Rate Reduction With 2:1 Resolution Change Generally, there are three sources of errors for transcoding with spatial resolution downscaling. 1) Downscaling: Since we intend to obtain a downscaled video, this kind of error is inevitable. It is often a designer’s choice in the downscaling filter wherein a tradeoff in visual quality and complexity can be made, especially when downscaling in the spatial domain. 2) Requantization error: As with the pure bit-rate-reduction transcoding process, this is the error due to the requantization. However, requantization error is considered most unnoticeable compared with other two sources of errors for spatial resolution downscaling transcoding. With proper residue error compensation and higher bit rate, this error can be eliminated. 3) Motion error: Incorrect motion vector will lead to wrong motion-compensated prediction. Worse even, this error

can only be compensated by redoing the motion compensation based on the new MVs and modes. Fortunately, this is not a problem for MPEG-2 to WMV transcoding for P-frames since WMV supports the four-MV coding mode for P-frames. However, it is a problem for transcoding B-frames for which WMV supports only the one-MV mode.4 We will address the last two sources of errors one by one below and propose corresponding architectures to cope with them. 1) Requantization Error Compensation: From Fig. 3, we can ) as follows: derive the input to the VC-1 encoder for frame (

(5) Assume downscaling and motion compensation processes are commutable, we can obtain

(6) Further assume that motion compensation is a linear operation , and reuse the motion vectors (i.e., and , with proper rounding to half-pel precision), (5) can be finally simplified to (7) , refers to the downscaling process The first term in (7), to the decoded MPEG-2 residue signal. We adopt the DCTdomain downscaling. WMV allows 4 4-sized transform for P-frames and B-frames. Therefore, we can directly map the an input MB at the original resolution to an existing 8 8 block of the new output MB at the reduced resolution, all in the DCT domain. In other words, the DCT-domain downscaling can be achieved extremely simply by retaining only the top-left 4 4 low-frequency DCT coefficients of an incoming block. As previously discussed in the mergence of the DCT and VC1-T, , we also need to scale each remained 4 4 subblock with which is given as

where and are the 4 4 transform matrices for is the normalstandard DCT and VC1-T, respectively. ization matrix of VC1-T, given by , with . 8-sized transform for However, WMV allows only 8 4 I-frames. Consequently, we need to convert the four 4 low-frequency DCT subblocks into an 8 8 VC1-T block. This is a well-studied topic [14], [22]. The difference is still the replacement of standard DCT with VC1-T and the normaliza. tion to the final results with 4B-frame syntax is a big difference between WMV and MPEG-4. MPEG-4 supports the four-MV mode and it is not a problem for transcoding B-frames to MPEG-4.

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

1467

Fig. 7. Simplified DCT-domain 2:1 resolution downscaling transcoder.

The second term in (7), , represents requantization error compensation at the downscaled resolution. Clearly, the MC in the MPEG-2 decoder and that in the WMV encoder are merged into a single MC process that operates on accumulated requantization errors at the reduced resolution. Comparing (7) with (2), we can find immediately that they are almost the same except that one is operating at the reduced resolution while the other is at the original resolution. Therefore, the proposed AEC-DST scheme can be applied here as well. We will refer to both schemes as AEC-DST. The simplified DCT-domain 2:1 resolution downscaling transcoder is shown in Fig. 7. The only difference between this transcoder and the one shown in Fig. 6 is the scaling module (which is denoted differently on purpose). The switches in these two figures have the same function. Note that, in Fig. 7, the first two modules (MPEG-2 VLD and inverse quantization) can be more efficiently implemented since only the top-left 4 4 portion out of the 8 8 block needs to be processed. An interesting observation is that the mixed block processing module is avoided for transcoding MPEG-2 to WMV with 2:1 resolution downscaling because WMV supports mixed mode by allowing up to three consisting 8 8 blocks of an Inter-coded MB to be coded with Intra mode. In other words, we allow an Intra MB at the original resolution to be mapped into an 8 block of an Inter MB at the reduced resolution. Intra 8 Recall that mixed-block processing requires a decoding loop to reconstruct the full-resolution picture. Therefore, the removal of mixed-block processing module implies significant computation savings. The final mode mapping rule is if all Mode if all Mode otherwise. Furthermore, the simplified DCT-domain 2:1 resolution downscaling transcoder is almost drift-free for P-frames, thanks to four-MV mode support of WMV. The only cause of the drifting error, as compared with the CPDT architecture with the same downscaling filtering, is the rounding of MVs from quarter resolution to half resolution (to ensure ) and

the noncommutative property of MC and downscaling. Such remaining errors are negligible due to the low-pass downscaling filtering, be it achieved in the DCT domain or in the pixel domain. 2) Motion Error Compensation: Although WMV supports four MV coding mode, it is intended for P-frames only. As a result, the architecture shown in Fig. 7 is recommended for use only when there are no B-frames in the input MPEG-2 stream or the B-frames are to be discarded during the transcoding towards a lower temporal resolution. Due to the constraint that only one MV is allowed for B-frame MBs in WMV, we have to compose a new motion vector from the four MVs associated with the MBs at the original resolution. All of the previously mentioned MV composition method can be applied here. In our implementation, we have chosen to use median filtering. As mentioned earlier, incorrect MV will lead to wrong motion-compensated prediction. Worse even, no matter how the requantization error is compensated, and no matter how high the bit rate goes, one can never get perfect results if not redoing the motion compensation based on the new MVs. Therefore, we have to come up with an architecture that allows such motion errors to be compensated. Under this circumstance, the assumption no longer holds. However, we can manipulate (6) to obtain

(8) Clearly, the last two terms in the square brackets in (8) imply that the compensation of the motion errors caused by inconsistent MVs or caused by different MC filtering methods between MPEG-2 and WMV. The corresponding modules for this purpose are highlighted and grouped into a shaded block in Fig. 8. Note that, in (8), is performed for all the 8 8 blocks that correspond to original Inter MBs, and with quarter pixel precision. The MV used in the VC-1 encoder is a single MV, which is the median of the MVs of the four corresponding MBs at the original resolution, and the accuracy of can go to quarter-pixel level as well.

1468

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 8. Simplified 2:1 downscaling transcoder with full drifting error compensation.

Furthermore, MC for motion error compensation purposes operates on reconstructed pixel buffers while the MC for requantization error compensation purposes operates on the accumulated residue error buffer. The second term in (8) is to compensate the requantization error of reference frames. Since B-frames are not used as reference for other frames, they are more error-tolerant. As a result, the error compensation can be safely turned off in most cases to achieve higher speed, but such approximation is intended for B-frames only. As to the mode composition, we can either apply Intra-toInter or Inter-to-Intra conversion easily since we have reconstructed the B-frame and the reference frames at the MPEG-2 decoder part, both at already reduced resolution. This conversion is done in the mixed-block processing module in Fig. 8. Two mode composition methods are possible: one method is to select the dominant mode as the new mode. The other method is to select the mode as the one that will lead to the largest error. This will lead to better quality than selecting the one with the smallest error because we provide an opportunity to compensate for the large error. The idea is similar to the align-to-worst MV composition strategy in [27]. 3) Complexity Scalability With Dynamic Switches: The final architecture according to (8) is shown in Fig. 8. The resulting architecture is seemingly just as complex as the reference-cascaded pixel-domain transcoder. Actually, it is not. The explicit pixel-domain downscaling process is avoided. Instead, it is implicitly achieved in the DCT domain by simply discarding the high DCT coefficients. More importantly, the resulting architecture has excellent complexity scalability that can be achieved by , and . various switches, i.e., We add four more frame-level switches specifically for the architecture shown in Fig. 8. The functionalities of these switches are also listed in Table II. The four switches ensure different coding paths for different frame types. Specifically, no residue, no motion error accumulation is performed for B-frames , and error compensation is performed for I- and P-frames no reconstruction of reference frames if there is no B-frames . We want to point out here that the to be generated

can be applied at block-level as well frame-level switch since the motion error needs to be compensated only when the corresponding four original MVs are significantly inconsistent. Finally, for the applications that demand ultrafast transcoding speed, we can turn the architecture into an open-loop one by turn off all the switches. The open-loop architecture can be further optimized by merging the inverse quantization process of MPEG-2 and the requantization process of WMV. The inverse zigzag scan module (inside VLD) of MPEG-2 can also be combined with the zigzag scan module in the WMV encoder with certain programming tricks. In short, thanks to the support of four-MV and mixed coding modes for P-frames in WMV, both the requantization error and motion error compensation can be efficiently achieved and controlled by various switches towards complexity scalability. However, for B-frames, there is a constraint on the one-MV coding mode. As a result, motion error compensation has to be performed by full reconstruction of the input signal but at a reduced resolution, i.e., through partial MPEG-2 decoding. Various frame-level switches are introduced for complexity reduction. IV. PERFORMANCE ANALYSIS We have performed extensive experiments to verify the effectiveness of the proposed transcoding architectures. We will report in this section the speedups and corresponding quality losses for different scenarios such as bit rate reduction with or without spatial resolution downscaling, while our emphasis is on the complexity scalability and adaptive error accumulation based drifting control mechanism. A. Experimental Setup The experimental platform is a Windows XP PC with Pentium-IV 3-GHz CPU and 512-MB memory. Two test sequences were used. One is BestCap, whose resolution is 640 480. The bit rate for this experimental MPEG-2 input bitstream is 5.7 Mb/s and the PSNR is 44.42 dB. The other test sequence is SmallTrap which is of standard definition

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

Fig. 9. Coding efficiency penalty due to the mergence of IDCT in the MPEG-2 decoder and the VC1-T in the WMV encoder.

(720 480). The bit rate for this MPEG-2 input bit stream is 5.2 Mb/s and the PSNR is 43.22 dB. Both sequences consist of quite a few typical video scenes such as slow motion, high motion, fading, low texture, and high texture.5 Both sequences are encoded with GOP length equals to 15 and in pattern. Note that the PSNR for the transcoding schemes is calculated using the decoded sequence from the resulting transcoded WMV bit stream against the original (i.e., input to the MPEG-2 encoder) video sequence. B. Effect of Transform Mergence In the first set of experiments, we evaluate the impact of the mergence of IDCT of MPEG-2 decoder with the VC1-T in the WMV encoder on the quality degradation. As stated before, the merge of the transforms is an important step in obtaining the 5To respect the copyright, we created downscaled (2:1) and low-quality encoded versions (250 kb/s) of the sequences. They can be accessed at http://research.microsoft.com/~jackysh/private_sharing.htm

1469

Fig. 10. Coding efficiency comparison of different transcoding schemes without spatial resolution reduction.

final significantly simplified architectures in both Figs. 6 and 7. We use the architectures in Figs. 4 and 5 for evaluation since the only difference between them is the merge of transforms. We know that, after the mergence of transforms, we need a scaling . We refer module that performs a per-element scaling with to this scaling as matrix scaling. As argued before, since the are very close to each other, the scaling can be elements of further simplified as a scaling by a scalar. We refer to this as scalar scaling. Fig. 9 shows the coding efficiency of transcoding MPEG-2 bit streams to WMV ones with different target bit-rate settings for the SmallTrap and BestCap sequences. As can be seen from the figures, the performance loss due to the merge of transforms is very small and is negligible for the matrix scaling. However, the scalar scaling scheme will introduce a larger loss. Considering the fact that scalar scaling only provides a slight speed improvement over the matrix scaling one, we recommend using the matrix scaling scheme, as we did in all of the other experiments.

1470

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 11. Speed comparison of different transcoding schemes without spatial resolution reduction.

C. Complexity Scalability Versus Graceful Quality Degradation Here, we provide experimental results for AEC-DST schemes for bit-rate reduction and 2:1 downscaling transcoding applications. We compare the coding efficiency and speed of different transcoding schemes such as CPDT, CPDT with motion vector reuse, and the proposed AEC-DST transcoders. For reference purposes, we also give the results of the open-loop and closed-loop schemes. 1) Pure Bit-Rate-Reduction Transcoding: The coding efficiency comparisons for the transcoding without spatial resolution change are shown in Fig. 10 for the two test sequences. Since WMV has many advanced coding features which are only applicable for CPDT coding scheme, we study the performance for both cases where those advanced features are turned ON and OFF. From the figures, we can observe the following. 1) The performance of AEC-DST scheme is always bounded by those of the closed-loop and open-loop schemes. This is indeed as expected. In fact, the performance of AEC-DST

Fig. 12. Coding efficiency comparison of different transcoding schemes with 2:1 spatial resolution reduction.

scheme can freely move between the two bounds, all depending on the proper settings of transcoding parameters. 2) The performance of the closed-loop transcoder is very close to the CPDT with motion vector reuse scheme. This is reasonable since both schemes reuse the motion information. What makes the difference is the rounding error. 3) The performance of the CPDT with motion vector reuse scheme can exceed that of the CPDT with motion reestimation when the targeting bit rate is relatively high. This is because in the motion reestimation case, the encoder tries to find best motion towards the already distorted input, which not necessary leads to the best motion towards the very original video against which we are calculating PSNR. On the contrary, the MV reuse scheme sticks to a more precise input MV. 4) The performance of the CPDT with motion reestimation results in best performance in the low-bit-rate range. This is in sharp contrast to the bullet above. The reason is that

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

Fig. 13. Speed comparison of different transcoding schemes with 2:1 spatial resolution reduction.

in the low-bit-rate case, the requantization error dominates the performance. The original motion vector is not suitable anymore, even though it is truer, because the large requantization error has led to a significantly distorted reference. The speed comparisons for SmallTrap and BestCap sequences are shown in Fig. 11. We exclude the numbers of the two CPDT with motion reestimation schemes because their speed is extremely slow and heavily depends on the motion estimation method adopted. The figures clearly shown the improvement of simplified transcoder architectures. The speed of the closed-loop scheme almost doubles that of the CPDT with MV-reuse and can achieve roughly twice the real-time (i.e., 30 fps) requirement. The speed of the open-loop scheme is extremely fast: it can achieve five to seven times real-time speed. Again, as expected, the speed of AEC-DST scheme lies in between those of the closed-loop and open-loop schemes. Considering the coding efficiency performance of AEC-DST, we can conclude that AEC-DST indeed achieves a desired tradeoff between the coding efficiency and the speed.

1471

Fig. 14. Complexity scalability of AEC-DST scheme. The left figure is for pure bit-rate-reduction transcoding and the right figure is for bit-rate reduction with 2:1 spatial resolution downscaling.

2) Bit-Rate Reduction and 2:1 Downscaling: Fig. 12 depicts the coding efficiency performances for the open-loop, closed-loop, and AEC-DST transcoding schemes with 2:1 spatial resolution reduction for SmallTrap and BestCap sequences. The ground truth for PSNR calculation is obtained by applying the same DCT-domain down-sampler (as the one used in the MPEG-2 decoding part in the transcoder) on the original video. Done this way, we minimize the impact of different downscaling filters and completely focus on the transcoding efficiency. Fig. 13 shows the speed comparison. Note that in our experiment the CPDT speed is 28–30 and 30–32 fps across the bit-rate ranges shown in the figures for SmallTrap and BestCap, respectively. Since using the DCT-domain down-sampler as the standalone intermediate filtering module in CPDT (i.e., the “Filtering for New Property” module in Fig. 3) would require performing DCT, retaining partial DCT coefficients (mainly memory shuffling), and performing IDCT again. Its complexity makes it an

1472

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Fig. 15. Drifting control comparison with CST and proposed AEC-DST transcoding schemes for the SmallTrap sequence.

impractical solution even if CPDT is to be used. Therefore, we do not include the CPDT speed in Fig. 13 to avoid being misleading. From these figures, we can clearly seen that AEC-DST can provide a better tradeoff between coding efficiency and speed. Another observation, which echoes the observation in [18] and can be seen from Fig. 11 as well, is that the transcoding complexity is closely related to the output bit rate. Note that we did not compare the performance against the CPDT cases for spatial resolution reduction. Even though we perform the same DCT-domain downscaling, due to the recursive nature of motion-compensated prediction involved in both the MPEG-2 decoding part and WMV encoding part, the coding performance is significantly biased towards CPDT, even when compared with the closed-loop transcoding scheme. Nevertheless, the subjective quality of the closed-loop transcoding scheme is quite comparable to that of the CDPT, with little visible artifacts in the B-frames. This is, as previously mentioned, due to the constraint of the one-MV coding mode. 3) Complexity Scalability: As stated before, with AEC-DST schemes, the application can find a desired tradeoff between quality and speed. The tradeoff is controlled by the switches which is application controllable using thresholds. In our implementation, we provide ten intermediate thresholding levels (i.e., Threshold Level 1 to 10) between the closed-loop transcoder (which is represented by Threshold Level 1) and the open-loop transcoder (which is represented by Threshold Level 11). To better illustrate the tradeoff, we also label each point with the bit-rate change and the corresponding PSNR change, all with reference to the closed-loop performance. Due to space constraints, we only report the results for the SmallTrap sequence for the speed changes against the thresholding levels, as shown

in Fig. 14, for both pure bit-rate reduction and bit-rate reduction with 2:1 spatial resolution downscaling. In the figure, the bit-rate change and PSNR change are depicted by the dashed–dotted short line and dashed short line on each anchor point. From the figure, we can see that a faster speed usually comes at a larger PSNR penalty. However, the loss may not be as significant as the numbers indicate since the corresponding rate is slightly reduced as well. In other words, we need to consider both changes in PSNR and bit rate when interpreting the performance loss. On the other hand, the general trend of tradeoff between quality and speed always holds. D. Drifting Error Control Here, we show the excellent drifting control capability of the proposed AEC-DST scheme. As a benchmark, we compare our scheme against the CST scheme [15]. The experiments are carried as follows. We force the error update switch off (i.e., let the MC selector be OFF in the CST scheme, and in our AEC-DST, the switch is connected to position instead of position ) for the first P-frame in every GOP. The switch is then turned on for the rest B- and P-frames in the GOP. That is, all the frames except the first P-frame in every GOP are transcoded with the closed-loop transcoding architecture. The closed-loop transcoder (for all the frames) is used as the reference since it provides the best coding efficiency. The comparisons are shown in Figs. 15 and 16 for SmallTrap and BestCap sequences, respectively. In each figure, we also show the differences between the complexity scalable schemes and the closed-loop one so as to have a closer view at the drifting errors. As can be seen from all of the figures, both AEC-DST and CST schemes have the largest quality loss of first P-frame (the fourth frame) in every GOP because of not performing error

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

1473

Fig. 16. Drifting control comparison with CST and proposed AEC-DST transcoding schemes for the BestCap sequence.

updating. The big quality loss of first P-frame also propagates to the two preceding B-frames. However, for our AEC-DST scheme, the quality will quickly catch up with the closed-loop transcoder for the rest frames in the GOP, while the CST scheme suffers from severe drifting. It is evident that the error accumulation indeed helps to control the drifting. All of the observations confirm the analysis in the Appendix. V. CONCLUSION AND REMARKS In this paper, we studied the problem of efficient transcoding from MPEG-2 to WMV format. Based on in-depth analysis of the error propagation behavior, we proposed two architectures with adaptive error compensation and dynamic switches for application scenarios that need bit-rate reduction with or without spatial resolution reduction. Both architectures feature excellent complexity scalability and adaptive drifting error control. In the derivation of these architectures, we also showed that the standard IDCT (as in all the MPEG series standards) can be merged with other DCT-like transform (e.g., the integer transform in WMV) with proper one-time per-element scaling. We performed extensive experiments to verify various design targets such as the drifting error control, complexity scalability, and the performance tradeoffs between speed and quality. Remark 1: For transcoding with arbitrarily resolution change, we can adopt the cascaded pixel domain transcoding architecture but using a two-stage downscaling strategy, that is, we decompose the overall downscaling ratio into a product of two proper downscaling ratios, exert DCT-domain downscaling for the first stage downscaling (i.e., fully embed the first stage downscaling into the decoding loop as we did for the 2:1 downscaling case), and perform the second stage downscaling

in the pixel domain at already reduced intermediate resolution. 720 p) to SD (720 480 p) For example, for HD (1280 transcoding, we need a downscaling ratio of 16:9. This can be achieved using two 4:3 downscaling stages. We can achieve the first 4:3 downscaling in the partial MPEG-2 decoder and use spatial filtering for the second 4:3 downscaling before re-encoding. Remark 2: The MPEG-4 syntax has significant overlap with that of WMV. Although WMV has many advanced coding technologies, they are not used in AEC-DST scheme. As a result, the techniques developed in this work can be readily applied to MPEG-2 to MPEG-4 transcoding applications, with possible simplifications such as no need for motion error compensation for B-frames since MPEG-4 supports four-MV syntax for B-frames. That is, the architecture shown in Figs. 6 and 7 are sufficient for MPEG-2 to MPEG-4 transcoding, without ever needing the one in Fig. 8. Remark 3: Due to the space limit of the paper, we did not touch the transcoding of interlaced content. However, we have implemented a transcoder than can efficiently transcode interlaced content using the similar ideas proposed in this paper. APPENDIX ANALYSIS ON ERROR PROPAGATION AND CONTROL Now we provide thorough theoretical analysis about the error propagation and reveal why the error accumulation is important. Throughout the analysis, the superscript stands for ground truth which is obtained using a closed-loop transcoder; the superscript and stand for the CST scheme, and our AEC-DST scheme, respectively.

1474

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

B. CST Scheme Case For the th frame, we have (15) (16) (17) For the

th frame, we have

Fig. 17. Block diagram of a standard decoder.

Since the transforms are linear and properly taken care of, we omit the transform symbols in the derivation below. Also, we omit the motion vector in the operation because both the MC operations and the MVs are the same. Let be the input to the VC-1 requantizer for the proposed AEC-DST scheme and the input to the MPEG-2 requantizer in CST, and be the output of the inverse MPEG2 quantizer. be the residue error buffer. To simplify the expressions, we use to represent the operation of inverse quantization over a quantized input, i.e., . Assume the same quantization parameters are used in the two transcoders. Please refer to Figs. 2 and 6 for the positions of the symbols. In order to study the impacts of different schemes on the decoder, we assume a standard decoder as shown in Fig. 17 is cascaded to the CST and AEC-DST transcoders. We use a closed-loop transcoder as our ground truth because it compensates for the requantization errors all of the time. Now suppose that, for the CST scheme, the MC selector is turned off for the th frame for the first time and the switch is turned on again from the th frame and on, equivalently, for our AEC-DST, Switch is always connected to the position (update) until the th frame, for which the switch is connected to the position (accumulate), and reconnected to the position again from the th and on. Let represents the pixel buffer at the decoder, i.e., the decoder output. Obviously, the buffer contents are the same up to the th frame for all the transcoders (including the ground truth closed-loop transcoder) and the corresponding decoders. Note that is the residue error buffer and is the frame buffer. Generally, for all of the transcoders.

C. AEC-DST Scheme Case For the th frame, we have (21) (22) (23) For the

th frame, we have (24) (25) (26)

Comparing (15) and (21) with (9), we found that the error buffer content is not updated to the input signal for the th frame and, from (22), that our AEC-DST scheme indeed performs error accumulation even if we do not perform error update for the th frame, which is in sharp contrast to the CST scheme [see (16)], where the error buffer is always refreshed by the new requantization error. However, further comparing (17) and (23) with (11), we see that the decoder output for the th frame is the same for both the CST and AEC-DST schemes and is the same as that for the ground truth. This is indeed as expected since any operation in the feedback loop will experience one frame delay to be effective. Now, let us see the impact of turning off the switch on the th frame. The effect is conveyed through the error signal since the decoder buffer content for the two transcoders are . the same up to the th frame, i.e., D. Comparisons of CST and AEC-DST Against Closed-Loop Transcoder

A. Ground Truth (Closed-Loop) Transcoder Case For the th frame, we have (9) (10) (11) For the

(18) (19) (20)

To compare the CST and the AEC-DST against the ground th frame truth, we calculate their differences for the buffers at the decoder by subtracting (20) and (26) from (14), respectively, as follows:

th frame, we have (12) (13) (14)

(27)

SHEN et al.: MPEG-2 TO WMV TRANSCODER WITH ADAPTIVE ERROR COMPENSATION AND DYNAMIC SWITCHES

and

(28) To simplify, let us assume zero motion for the th frame, and then (27) and (28) will, respectively, be simplified and lead to

(29) and

(30) Paying attention to the underlined terms in (29) and (30), it becomes clear that, when the is sufficiently large (which may be due to accumulation) to trigger the quantizationlevel change, it will be canceled out in our AEC-DST scheme, while it remains in the CST scheme. Assuming , where and is half the step size of is smaller the quantizer , it is easy to prove that . Note further that the term is the than main cause of the drifting error in the CST scheme once the MC selector is switched off, and it will keep on propagating to subsequent frames (until the next Intra frame) due to the predictive nature of the MPEG-2 and WMV coding. In AEC-DST, such drifting error is well controlled, i.e., as long as it gets larger to a certain degree, the main part of the error signal will be compensated. REFERENCES [1] Y. Su, J. Xin, A. Vetro, and H. Sun, “Efficient MPEG-2 to H.264/AVC intra transcoding in transform-domain,” in Proc. IEEE Int. Symp. Circuits Syst., Kobe, Japan, May 2005, pp. 1234–1237. [2] Z. Zhou, S. Sun, S. Lei, and M.-T. Sun, “Motion information and coding mode reuse for MPEG-2 to H.264 transcoding,” in Proc. IEEE Int. Symp. Circuits Syst., Kobe, Japan, May 2005, pp. 1230–1233. [3] Y.-P. Tans and Y.-Q. Liang, “Methods and need for transcoding MPEG-2 fine granularity scalability video,” in Proc. IEEE Int. Symp. Circuits Syst., 2002, vol. 4, pp. 719–722. [4] J. Xu, F. wu, and S. Li, “Transcoding for progressive fine granularity scalable video coding,” in Proc. IEEE Int. Symp. Circuits Syst., 2004, pp. 765–768. [5] E. Barrau, “MPEG video transcoding to a fine-granular scalable format,” in Proc. IEEE Int. Conf. Image Process., 2002, vol. 1, pp. 717–720. [6] J. Youn, J. Xin, and M.-T. Sun, “Fast video transcoding architectures for networked multimedia applications,” in Proc. Int. Symp. Circuits Syst., 2000, vol. 4, pp. 25–28. [7] G. de los Reyes, A. R. Reibman, S.-F. Chang, and J.-I. Chuang, “Errorresilient transcodign for video over wireless channels,” IEEE J. Sel. Areas Commun., vol. 18, no. 6, pp. 1063–1074, Jun. 2000.

1475

[8] S. Dogan, A. Cellatoglu, M. Uyguroglu, A. Sadka, and A. Kondoz, “Error-resilience video transcoding for robust inter-network communications using GPRS,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 6, pp. 453–464, Jun. 2002. [9] S. Srinivasan, P. Hsu, T. Holcomb, K. Mukerjee, Shankar, L. Regunathan, B. Lin, J. Liang, M.-C. Lee, and J. R. Corbera, “Windows media video 9: Overview and applications,” Signal Process.: Image Commun., vol. 19, no. 9, pp. 851–875, Oct. 2004. [10] J. Xin, C.-W. Lin, and M.-T. Sun, “Digital video transcoding,” Proc. IEEE, vol. 53, no. 1, pp. 84–97, Jan. 2005. [11] A. Vetro, C. Christopoulos, and H. Sun, “Video transcoding architectures and techniques: An overview,” IEEE Signal Process. Mag., vol. 20, no. 2, pp. 18–29, Mar. 2003. [12] W. Zhu, K. Yang, and M. Beacken, CIF-to-QCIF Video Bitstream Down-Conversion in the DCT Domain Jul.–Sep. 1998, Tech. Rep. 3. [13] C.-W. Lin and Y.-R. Lee, “Fast algorithms for DCT-domain transcoding,” in Proc. IEEE Int. Conf. Image Process., Thessaloniki, Greece, Oct. 2001, pp. 421–424. [14] Y.-R. Lee, C.-W. Lin, and C.-C. Kao, “A DCT-domain video transcoder for spatial resolution downconversion,” in VISUAL’02 Proc. 5th Int. Conf. Recent Advances Visual Inf. Syst. , 2002, pp. 207–218. [15] E. Barrau, “A scalable MPEG-2 bit-rate transcoder with graceful degradation,” IEEE Trans. Consumer Electron., vol. 47, no. 3, pp. 378–384, Aug. 2001. [16] L. Yuan, F. Wu, Q. Chen, S. Li, and W. Gao, “The fast close-loop video transcoder with limited drifting error,” in Proc. IEEE Int. Symp. Circuits Syst., 2004, pp. 769–772. [17] G. Shen, B. Zeng, Y.-Q. Zhang, and M. L. Liou, “Transcoder with arbitrarily resizing capability,” in Proc. IEEE Int. Symp. Circuits Syst,, 2001, vol. 5, pp. 25–28. [18] J. Xin, M.-T. Sun, B.-S. Choi, and K.-W. Chun, “An HDTV-to-SDTV spatial transcoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 11, pp. 998–1008, Nov. 2002. [19] Y.-P. Tan and H. Sun, “Fast motion re-estimation for arbitrary downsizing video transcoding using H.264/AVC standard,” IEEE Trans. Consumer Electron., vol. 50, no. 3, pp. 887–904, Aug. 2004. [20] C. Wang, H.-B. Yu, and M. Zheng, “A fast scheme for arbitrarily resizing of digital image in the compressed domain,” IEEE Trans. Consumer Electron., vol. 49, no. 2, pp. 466–471, May 2003. [21] K. N. Ngan, “Experiments on two-dimensional decimation in time and orthogonal transform domains,” Signal Process., vol. 11, pp. 249–263, 1986. [22] R. Dugad and N. Ahuja, “A fast scheme for image size change in the compressed domain,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 4, pp. 461–474, Apr. 2001. [23] J. Youn, M.-T. Sun, and C.-W. Lin, “Motion vector refinement for highperformance transcoding,” IEEE Trans. Multimedia, vol. 1, no. 1, pp. 30–40, Mar. 1999. [24] M.-J. Chen, M.-C. Chu, and C.-W. Pan, “Efficient motion estimation algorithm for reduced frame-rate video transcoder,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 4, pp. 269–275, Apr. 2002. [25] S. J. Wee, J. G. Apostolopoulos, and N. Feamster, “Field-to-frame transcoding with spatial and temporal downsampling,” in Proc. IEEE Int. Conf. Image Process., 1999, vol. 4, pp. 271–275. [26] Y. Liang, L.-P. Chau, and Y.-P. Tan, “Arbitrary downsizing video transcoding using fast motion vector reestimation,” IEEE Signal Process. Lett., vol. 9, no. 11, pp. 352–355, Nov. 2002. [27] B. Shen, I. K. Sethi, and B. Vasudev, “Adaptive motion-vector resampling for compressed video downscaling,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 6, pp. 929–936, Sep. 1999. [28] M.-J. Chen, M.-C. Chu, and S.-Y. Lo, “Motion vector composition algorithm for spatial scalability in compressed video,” IEEE Trans. Consumer Electron., vol. 47, no. 3, pp. 319–325, Aug. 2001. [29] J. Yeh and G. Cheung, “Complexity scalable mode-based H.263 video transcoding,” in Proc. IEEE Int. Conf. Image Process., Sep. 2003, pp. 169–172. [30] N. Bjork and C. Christopoulos, “Transcoder architectures for video coding,” IEEE Trans. Consumer Electron., vol. 44, no. 1, pp. 88–98, Feb. 1998. [31] J. W. C. Wong and O. C. Au, “Modified predictive motion estimation for reduced-resolution video from high-resolution compressed video,” in Proc. IEEE Int. Symp. Circuits Syst., 1999, vol. 4, pp. 524–527. [32] P. Yin, A. Vectro, B. Liu, and H. Sun, “Drifting compensation for reduced spatial resolution transcoding,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 11, pp. 1009–1020, Nov. 2002.

1476

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006

Guobin Shen (S’99–M’02–SM’06) received the B.S. degree from Harbin University of Engineering, Harbin, China, in 1994, the M.S. degree from Southeast University, Nanjing, China, in 1997, and the Ph.D. degree from the Hong Kong University of Science and Technology (HKUST) in 2001, all in electrical and electronic engineering. He was a Research Assistant with HKUST from 1997 to 2001. Since then, he has been with Microsoft Research Asia, Beijing, China, where he is now a Researcher and Project Lead with the Wireless and Networking Group. His research interests include digital image and video signal processing, video coding and streaming, distributed/parallel computing and peer-to-peer networking, general computing on GPU, wireless networking and mobile computing, and media management. He has published approximately 12 journal papers and more than 30 conference papers. He has been granted two U.S. patents and filed more than a dozen patent applications. He is now an Associate Editor for the Journal of Advances in Multimedia, a member of the Multimedia System and Applications Technical Committee (MSATC) of the IEEE Circuits and Systems Society, and a TPC member for several international conferences and as reviewer for several journals and many conferences. Dr. Shen is a member of the Association for Computing Machinery.

Yuwen He received the B.S. degree in applied physics and the Ph.D. degree in computer applications from Tsinghua University, Beijing, China, in 1998 and 2002, respectively. From 2002 to 2003, he was a Lecturer with the Computer Science and Technology Department, Tsinghua University. From January to April 2003, he was a Visiting Researcher on H.264 Codec SoC design with ARM core in Electronic Engineering Department, National Chiao Tung University. In 2004, he joined the Internet Media Group, Microsoft Research Asia, Beijing, China, as a Postdoctoral Researcher. In 2005, he joined Panasonic Singapore Laboratories, Singapore, as a Senior Engineer. His current research interests include video coding, transcoding, transmission, video processing in embedded system, and software tamper resistance.

Wanyong Cao received the M.S. degree in computer science from Peking University, Peking, China, in 2002. He is currently a Research Software Engineer with Microsoft Research Asia, Beijing, China. His current interests include software engineering, multimedia technology, and vertical search technology.

Shipeng Li received the B.S. and M.S. degrees from the University of Science and Technology of China (USTC), Hefei, China, in 1998 and 1991, respectively, and the Ph.D. degree from Lehigh University, Bethlehem, PA, in 1996, all in electrical engineering. He was with the Electrical Engineering Department, USTC, during 1991–1993. He was a Member of Technical Staff with Sarnoff Corporation, Princeton, NJ, during 1996–1999. He has been a Researcher with Microsoft Research Asia, Beijing, China, since May 1999 and has contributed some technologies in MPEG-4 and H.264. His research interests include image/video compression and communications, digital television, multimedia, and wireless communication.

Suggest Documents