IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
119
Direct Mode Coding for Bipredictive Slices in the H.264 Standard Alexis Michael Tourapis, Member, IEEE, Feng Wu, Member, IEEE, and Shipeng Li, Member, IEEE
Abstract—The new H.264 (MPEG-4 AVC) video coding standard can achieve considerably higher coding efficiency compared to previous standards. This is accomplished mainly due to the consideration of variable block sizes for motion compensation, multiple reference frames, intra prediction, but also due to better exploitation of the spatiotemporal correlation that may exist between adjacent Macroblocks, with the SKIP mode in predictive ( ) slices and the two DIRECT modes in bipredictive ( ) slices. These modes, when signaled, could in effect represent the motion of a macroblock (MB) or block without having to transmit any additional motion information required by other inter-MB types. This property also allows these modes to be highly compressible especially due to the consideration of run length coding strategies. Although spatial correlation of motion vectors from adjacent MBs is used for SKIP mode to predict its motion parameters, until recently, DIRECT mode considered only temporal correlation of adjacent pictures. In this letter, we introduce alternative methods for the generation of the motion information for the DIRECT mode using spatial or combined spatiotemporal correlation. Considering that temporal correlation requires that the motion and timestamp information from previous pictures are available in both the encoder and decoder, it is shown that our spatial-only method can reduce or eliminate such requirements while, at the same time, achieving similar performance. The combined methods, on the other hand, by jointly exploiting spatial and temporal correlation either at the MB or slice/picture level, can achieve even higher coding efficiency. Finally, improvements on the existing Rate Distortion Optimization related to slices within the H.264 codec are also presented, which can lead to improvements of up to 16% in bit rate reduction or, equivalently, more than 0.7 dB in PSNR. Index Terms—Biprediction, DIRECT mode, H.264, motion compensation, MPEG-4 AVC, spatial correlation, temporal correlation, video coding.
I. INTRODUCTION
T
HE new H.264 (or JVT, H.26L, MPEG-4 AVC) [1] video coding standard has gained more and more attention recently, mainly due to its higher coding efficiency versus previous standards. This new standard essentially relies on several new features [2] such as the consideration of variable block sizes, ranging from 16 16 down to 4 4, and a quadtree structure for motion compensation, multiple reference frames, intra prediction, context adaptive entropy coding, in loop deblocking filtering, but also due to the consideration of generalized bipredictive ( ) slices [3]. Unlike older standards slices can such as MPEG-2 [4] and MPEG-4 part 2 [5],
use multiple predictions from pictures coming from the same direction (forward or backward), while these could also be used as references for other slices. Unfortunately, the above also implied that for this standard a considerably higher percentage of bits is needed for encoding motion information, either due to signaling prediction modes, motion vectors, and/or references. To alleviate this problem the SKIP [2] and DIRECT modes were introduced within predictive and slices, respectively, according to which motion is derived directly from previously encoded information, thus not having to encode any additional motion data for a macroblock (MB) or block. Motion for these modes was obtained by exploiting either spatial (SKIP) or temporal (DIRECT) correlation of the motion of adjacent MBs or pictures. Furthermore, considering the bipredictive slices, DIRECT mode could derive two such nature of motion vectors pointing to different references, namely the List 0 and List 1 references, thus leading to even further performance benefit. Additionally, since these modes can be signaled without having to transmit any residual data and due to their very high occurrence within a video stream, run length coding (RLC) strategies are employed within H.264 for coding mode information that depend on the entropy encoding scheme used. This can lead to further increase in coding efficiency. In particular, if several (i.e., RunN) adjacent MBs according to the scanning order are to be coded using SKIP mode or DIRECT mode without coefficients then only the actual number of such MBs needs to be coded instead of coding each one individually. On the other hand, a drawback of this method is that if no such MB exists, it is also necessary to signal a zero runlength code, which could introduce additional overhead within the bitstream. Nevertheless, due to the high occurrence of SKIP and DIRECT modes, this negative impact is more than compensated in most cases. DIRECT mode motion parameters using temporal correlation, were essentially derived for the current MB/block by considering the motion information within a co-located position in a usually subsequent reference picture or more precisely the first List 1 reference. Following the assumption that an object is moving with constant speed these parameters are scaled according to the temporal distances (Fig. 1) of the reference picand for a DItures involved. The motion vectors RECT coded block versus the motion vector of its colocated position in the first List 1 reference were originally calculated as
Manuscript received March 28, 2003; revised January 16, 2004. This paper was recommended by H. Sun. A. M. Tourapis was with Microsoft Research Asia, Beijing 100080, China. He is now with DoCoMo Communications Laboratories USA, Inc., San Jose, CA 95110 USA (e-mail:
[email protected]). F. Wu and S. Li are with the Microsoft Research Asia, Beijing 100080, China (e-mail:
[email protected],
[email protected]). Digital Object Identifier 10.1109/TCSVT.2004.837021 1051-8215/$20.00 © 2005 IEEE
(1) (2)
120
Fig. 1.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
DIRECT prediction in B slice coding.
but were later replaced with equations (3) (4) (5) (6) which can reduce the number of divisions required since varican be precomputed at the slice/picables and and are the temporal disture level. In the above tances, or more precisely picture order count (POC) distances [1], of the reference picture used by the List 0 motion vector of the colocated block in the List 1 picture compared to the current and the List 1 picture, respectively. The List 1 reference picture and the reference in List 0 referred by the motion vectors of the colocated block in List 1 are used as the two references of DIRECT mode. Due to the temporal nature of the derivation of the motion parameters we will name this DIRECT mode as Temporal DIRECT. Temporal relationship unfortunately tends to reduce considerably when the temporal distance between pictures involved increases, or when motion is not continuous and irregular. Furthermore, the temporal design does not consider efficiently the case of the colocated position being coded in intra mode, meaning that there is no motion information associated with such partition. For this case, the current design makes the assumption that motion information of the colocated can be considered as zero, while using the first forward and backward pictures as references. However, it is very likely that in this case the two corresponding positions in these references have little if any relationship and using their average result for prediction might not be as appropriate or effective. As an example, we can consider the case where the entire first due to a scene change occurList 1 picture is coded as intra rence (between and in Fig. 2). This apparently means that both positions used for prediction (i.e., and ), if temporal DIRECT is to be used, are likely to be uncorrelated and the resulting prediction from this method for both pictures/slices and may be meaningless since we could be averaging values from two uncorrelated sources. Obviously smaller regions could
also have similar issues, i.e., due to object occlusion, or motion itself might not satisfy the continuous motion assumption (objects might be instead accelerating, or the motion estimated is not the true motion of the sequence). In fact, it is possible that either the motion predicted or the references themselves are not good enough for prediction. Correlation also tends to decrease considerably when further considering the multireference nature of the H.264 codec. As can be seen from Fig. 3, it is very and with likely that the correlation of pictures/slices used by the colocated block in the List 0 reference picture is relatively low which again implies the first List 1 picture a potential reduction in coding efficiency. In addition to the above issues, temporal DIRECT requires that both the encoder and decoder have a priori knowledge of the time-stamp information for each slice, which might not always be desirable. This problem has been partially solved by using the POC distances instead of actual timing information, which are essentially embedded within the bitstream and are identical in both encoder and decoder. We would refer the reader to [1] for further information on this process. On the other hand, motion information of the List 1 references of a B slice also needs to be available which implies a considerable increase in the memory requirements of a system, especially when considering that H.264 allows up to 16 motion vectors per MB. Furthermore, the division operation, although was later replaced with the shift operations shown in (3)–(6), can negatively affect the implementation and performance cost of a system. In this letter, we introduce alternative methods for the derivation of the DIRECT mode parameters. Our methods rely on using spatial correlation or a combination of spatial and temporal correlation instead of only temporal, and can alleviate the previously discussed issues while retaining or even improving coding performance. Spatiotemporal combination can be done at either the MB or picture/slice level. Considering also the impact of rate distortion optimization (RDO) and mode decision in the performance of slices and DIRECT mode in particularly, we present some additional modifications in the mode decision process that can considerably improve the coding efficiency of the H.264 standard. The spatial correlation-based DIRECT mode is first discussed in Section II. The MB-based spatiotemporal DIRECT mode is then presented in Section III, while the picture/slice level combination is presented in Section IV. In Section V, our improvements on the RDO related to slices are discussed, followed by simulation results and the conclusion. It should be noted that due to their properties and efficiency some of the presented methods have already been proposed to and adopted by the H.264 standard. II. SPATIAL DIRECT MODE As we have previously discussed, the DIRECT mode design within H.264 was originally based on the assumption that high temporal correlation exists within a video sequence. In reality, however, apart from temporal correlation, high correlation also exists in the spatial domain. Motion vectors of spatially adjacent blocks can be highly correlated especially when these belong to the same object. This concept is already exploited in several fast
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
Fig. 2.
121
Low correlation between two reference pictures exists when a scene change occurs.
Fig. 3. Correlation might be low if the reference of the colocated is not the most recent previous picture.
Fig. 4. Motion Vector Prediction for E is the median value of A, B, and C or D depending on position.
motion estimation architectures [6], [7], but most importantly during the coding process of the motion vectors. This is roughly done through the estimation of a motion vector predictor (MVP) using the median of the motion vectors of the adjacent blocks (Fig. 4) on the left (A), top (B), and top-right (C). The location on the top-right is replaced with the top-left location (D), if it is not available. Since H.264 can also consider multiple references, if only one of these predictors uses the same reference as E, then only that predictor is considered instead of the median value. The MVP is used as a predictor and only the difference compared to the current motion vector needs to be coded. Considering that in many cases this MVP is very close to the true motion of the MB, H.264 also introduced the concept of the SKIP MB mode within slices. If a MB is coded as a SKIP MB then all its motion information is considered to be equal to the MVP of the 16 16 Inter MB type, unless either the top or left predictors had a zero motion vector and zero reference, or the skipped MB was on the first row or column. In these cases motion for a SKIP MB was again assigned to zero. On the other hand, regardless of the derivation method of its motion parameters, the 1st List 0 reference picture was always used as the reference for a skip MB. Obviously, a similar concept, using spatial correlation, could also be employed for DIRECT mode. The DIRECT mode using such correlation will be called as the Spatial DIRECT Mode. The derivation process of the motion parameters for the spatial DIRECT mode can be divided into two phases. An initial
process for the derivation of the reference pictures used for the prediction, which we will call as the reference picture selection (RPS) process, and the motion vector selection process. Unlike SKIP which always uses the same reference, partly due to the reasons discussed previously, we have found that predicting the references used can also impact performance. In our case, instead of always using zero, the reference picture index for each direction is selected as the minimum nonnegative (nonintra) reference of the spatially adjacent blocks used in the MVP process of a 16 16 MB. If there is no available prediction for one direction, then in this case the DIRECT mode uses single prediction from the single available predicted reference of RPS. If neither is available (i.e., all adjacent MBs are intra, or outside the slice/picture), then we select the closest reference pictures for both directions as was done for the intra case for temporal DIRECT. In this way, we can effectively exploit the spatial correlation that blocks may have in terms of their references, thus possibly solving some of the problems discussed previously (Fig. 5). For example, in the case of a scene change or object occlusion, using this method we would expect that for DIRECT mode the RPS would select only a single reference ( for and for ), thus avoiding averaging uncorrelated information. In a simplification, and in order to decorrelate this mode from the consideration of timing information, instead of considering the closest reference in terms of time we may also select the reference with the smallest reference number index with little if any impact in performance. After RPS, the motion vectors for each predicted reference are calculated in a relatively similar manner to the method employed by SKIP mode. However, we have found that performance can be considerably improved through partial consideration of temporal information. More precisely, we observe that if a region is stationary, that is having zero or very close to zero motion, it is also very likely that its colocated regions in adjacent pictures are also stationary. This property can be rather useful especially when considering that at object/background boundaries motion from a moving object may leak to its surrounding
122
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
Fig. 5. RPS can improve coding efficiency during scene changes.
when using the MVP method. On the other hand, by considering this property we could alleviate this problem, and essentially improve coding efficiency. In our case, this process is performed by initially examining whether the colocated block in the first List 1 reference is stationary, that is if its motion vectors are equal to in subpixel units inclusive and their associated reference index is equal to zero. If this condition is satisfied and if the reference for a direction through RPS is also zero then for this direction the motion parameters are set to zero. Otherwise, if the reference for a direction is not zero or the colocated is not considered as stationary, the MVP process for the 16 16 inter MB pointing to the reference generated through the RPS is used. Clearly using this method we can exploit spatial and temporal information seamlessly, while we can observe that only a stationarity condition (i.e., one bit per block) needs to be stored for each block. This is considerably lower than the storage required by the temporal method where both motion vectors and reference picture indices need to be stored. The above process also does not require any division operation, making it also much simpler to implement than the temporal method, while also not requiring the knowledge of any timestamp information since reference pictures and motion vectors are selected using a minimum index and median operation, respectively. III. MB-BASED SPATIOTEMPORAL DIRECT MODE COMBINING The above spatial method essentially solves all previously discussed problems, while having a relatively good performance compared to the existing temporal DIRECT mode. Nevertheless there might be cases where the temporal DIRECT mode may lead to higher coding efficiency compared to the spatial DIRECT mode. For example, the continuous motion assumption could hold true, while also using spatial prediction might not be representative of the motion within the sequence (e.g., at object boundaries). If we ignore the implementation and memory issues that we have already discussed, it would nevertheless be desirable to exploit temporal and spatial correlation together more effectively, thus leading to higher coding efficiency. In our case, we perform a combination of the temporal and spatial methods by adaptively selecting each method depending on the characteristics of its colocated region. According to this scheme, if the colocated position in the first List 1 reference is using the first List 0 reference picture and this reference is available, then the temporal method is used. Otherwise, if the colocated is intra or uses a nonzero reference, or if the reference is not available, the spatial method using the RPS and median prediction is used
to predict the reference pictures and motion vectors for the DIRECT coded block. In this case, the stationarity condition is obviously unnecessary since this would have already satisfied the requirements for the temporal method consideration. It is rather obvious that this method retains the properties of the spatial method (better performance at scene changes or object occlusion), while also potentially keeping some of the performance of the temporal mode. IV. SLICE-BASED SPATIOTEMPORAL DIRECT MODE COMBINING An alternative method for combining the temporal and spatial DIRECT methods can be performed at the picture/slice level. In particular, we may introduce a flag at every slice or picture that will specify whether the temporal or spatial DIRECT mode will be used. If only these two modes are used, the flag can be of only 1 bit, thus having a rather insignificant impact on coding efficiency. This slice-based approach for signaling the DIRECT mode used was also the one adopted by H.264. Obviously we may also consider the MB-based spatiotemporal method for either replacing one of these modes (e.g., the temporal one), or by considering it as a third alternative. This of course would require that this flag needs 2 bits per picture/slice instead of one, which again is insignificant. In terms of the decision process, this could either be explicitly done by the user (Fig. 6(b)), or more intelligent methods could be used for optimally selecting the best DIRECT method for each slice or picture. In our case we assume that each picture comprises of a single slice. The decision is performed by coding a slice using all three possible DIRECT modes, and by using a slice level RDO decision to select the one that is deemed as best. More specifically, after coding the slice using a specific ) and the sum of square error DIRECT method, the bits ( (SSE) distortion is calculated for the entire slice. We consider the DIRECT method that minimizes specified as (7) is a Lagrangian multiplier related where to the current QP value, as the best one and this is selected for coding the current slice. This process can be seen in Fig. 6(a). V. RDO FOR
-SLICES IN H.264
RDO with the use of Lagrangian multipliers has been used extensively within video compression technologies [9] since it can considerably increase coding efficiency.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
123
Fig. 6. DIRECT mode decision can be performed using (a) an adaptive slice level RDO decision or through (b) user selection.
This method is based on the principle of jointly minimizing and rate using an equation of the form both distortion . H.264 itself has also adopted RDO as an encoding tool within its reference software [10] for both motion estimation and mode decision. In particular, for motion estimation, the best motion vector was found by minimizing:
represents the number of bits associThe rate term . The reference picture and block sizes ated with choosing for the bipredictive mode are chosen as combination of the best forward and backward modes. After motion estimation for all block types and the decision of references, the MB mode decision is performed by minimizing a different Lagrangian functional, which is equal to
(8) with being the motion vector, being the prediction for the motion vector, and being the Lagrange multiplier. The rate term only represents motion information and could be easily computed through represents the sum of absolute differa table lookup. The ence distortion measure and can be computed as
(9)
(11) where is the MB quantizer, is the Lagrange multiindicates a mode chosen plier for mode decision, and from a set of potential predictions [2] for each different slice is defined as follows: type. More specifically, -
with being the original video signal and being the coded video signal. and are the vertical and horizontal dimensions of a block and can be equal to 16, 8, or 4. The Hadamard transform could also be used during subpixel refinement to further enhance performance. Finally the determination of the reference picture index (REF) and the associated motion vectors inter modes in and slices is done after for each motion estimation by minimizing
(10)
-
124
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
TABLE I PERFORMANCE COMPARISON OF PROPOSED DIRECT MODE METHODS WITHIN THE JVT CODEC
The is the sum of the squared differences between the original block and its reconstruction and is calculated as
(12) is the actual number of bits and and including the associated with choosing bits for the MB header, the motion, and all DCT blocks. and represent the reconstructed and original luminance values, while , and , represent the corresponding chrominance values. The Lagrangian is given by multiplier for and slices while for slices, , where is the MB quantization parameter. is calculated as In the entire RDO process the selection of plays probably the most significant role. Although the above selection is partly based on theory [9], in reality this was mainly based slices only, while and on experiments performed using slices were not carefully considered. More specifically, we , have observed that due to the fairly high value of there is relatively small variation in the performance, modes, and bits assigned for coding a slice compared to the use of a . Nevertheless if spatial DIRECT is slightly different would lead to a more accurate motion used, reducing field, thus better exploiting spatial correlation. Furthermore, would lead to a quality improvement using a smaller slices, which indirectly would also lead to within and slices, since the references considerably better quality for and motion vectors tend to be more accurate due to the smaller , while slice bit rate remains relatively the same. Considering that slices may occupy a significant percentage ratio of the entire sequence’s bit rate, especially if the same
quantizer is used for all slice types, using different and , when all slice types exist within a bitstream than the one originally recommended by H.264 can lead to improved coding efficiency. In particular, we have empiriand cally found that using , when slices are used, can considerably improve coding efficiency within an H.264 encoder. Considering that DIRECT mode is the most efficient mode within slices, the H.264 standard introduced two alternative methods to signal a DIRECT coded MB. If a MB is DIRECT and has no residual information, then this property can be further exploited by using RLC methods as we have previously described. This is essentially very similar to SKIP mode and can be referred to as the B_Skip mode. Nevertheless, if residual needs to be encoded, the direct mode is explicitly signaled with all its associated residual information. This DIRECT mode is called the B_Direct mode. If we carefully examine the definition for during the RDO process, we observe that even though slices SKIP mode is considered within the mode decifor sion, only one instance of DIRECT mode is examined. DIRECT essentially considers both B_Skip and B_Direct depending on the availability of a residual signal. However, considering that B_Skip can be considerably more efficient than B_Direct, we for slices to consider both cases regardcan modify less of whether there is any associated residual information to DIRECT mode. Obviously this consideration does not follow the proper quantization process, which is also the case for SKIP mode, and can be seen as a coefficient thresholding process. We have found though that it is best to consider B_Skip as an additional mode in the mode decision only if at least one of the 8 8 blocks within a MB has no residue. This comes from the fact that if one 8 8 block has no residue then it is also likely that its adjacent ones have little or insignificant information that could be further discarded without having much impact in quality. VI. SIMULATION RESULTS We have used the H.264 reference software version 4.2 in all of our simulations [10]. This software already supports the spatial DIRECT method and the proposed RDO method for
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
Fig. 7.
125
Rate Distortion comparison for sequences (a) Foreman CIF and (b) Mobile CIF at 30 f/s.
slices, since this has already been proposed to and adopted by the H.264 standard [8]. The MB-based spatiotemporal combination was also implemented, whereas we have also evaluated the picture/slice combination by introducing the necessary bits as described in Section IV. For our simulations we have selected several high and low motion sequences, including QCIF resolution sequences Foreman, Container, News, Silence, Stefan and CIF sequences Paris, Mobile, Tempete, Foreman, Stefan, and Coastguard. QCIF sequences Foreman, Container, and News were all coded at 10 f/s, while Silence, QCIF Stefan, and Paris were coded at 15 f/s. For all other sequences we have used 30 f/s, while an IBBPBBP…structure and a single slice per picture were used for all cases. We have also used a search area of 32, 5 reference pictures, and the CAVLC entropy method. Additional experiments and a more detailed analysis can be seen in [8]. For each sequence quantizer values of 28, 32, 36, and 40 were used. To simplify the performance evaluation we have used the integral rate-distortion method described in [10]. We have also implemented the original RDO (old RDO in Table I) values and does not explicmethod, which uses the old itly consider B_Skip, within the same software. Our detailed results can be seen in Table I. In this table, we compare the performance of the proposed RDO versus the original scheme, and the temporal versus the spatial, the MB-based spatiotemporal DIRECT, and the combined methods. The combined methods we have examined are the picture/slice combi, the combination of nation of spatial and temporal spatial and MB-based spatiotemporal , and the com. From bination of all three possible DIRECT types this table, we immediately observe that the modifications in the RDO process have introduced a substantial gain in PSNR (between 0.15 and 0.7 dB) or equivalently a significant bit rate reduction versus the original RDO scheme (2.7% and 15%). This immediately justifies our approach for introducing the B_Skip within the software. If we commode and the smaller pare only the temporal with the spatial method, we observe that for sequences with relatively smaller motion, such as Container, News, and Tempete, the temporal method gives better performance, while for the others, the spatial method appears to be
better. Apparently Container and News are sequences with very small or local motion in which the spatial method would have little benefit. Tempete on the other hand is a zooming sequence that contains many irregular smaller motion types that would not satisfy the spatial correlation conditions. The other sequences tend to be more spatially correlated, while we observe (Fig. 7) that performance seems to also be related to the quantization values. This comes from the observation that for smaller quanis also smaller and can lead to a more accutizers, rate motion field within a slice, while for larger quantizers is considerably larger and motion tends to be less refined. Temporal, on the other hand, relies more on the motion of slices which use smaller and thus can be more accurate at low bit rate. Performance is, nevertheless justified from the additional benefits (memory, no division etc) that the spatial technique provides. On the other hand, the MB-based spatiotemporal method seems to achieve a compromise between both methods. Even though performance for sequences that considerably benefited from the use of spatial DIRECT has slightly decreased (i.e., Foreman, Stefan, and Coastguard), performance has instead improved for all other sequences, either due to improvements at low bit rates (Mobile), or due to better decision based on image context (Paris, News, Tempete). The only sequence where the temporal method is still better is Container, but we can claim that the difference is insignificant. As we have expected, the picture/slice level combining of these methods can lead to even better performance. For example, the combination of the spatial and temporal methods could lead to an additional 2% bit rate reduction for Mobile and Foreman CIF, mainly because each method tends to be better for different quantizers. Replacing the temporal method with the MB-based combination leads to slightly higher gain, while the combination of all three methods together seems to lead to always the best performance. It can be observed that if we jointly consider the RDO process and the introduction of the new DIRECT modes, we can achieve an average performance improvement of 0.49 dB in PSNR or 10.8% in terms of bit rate reduction.
126
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 1, JANUARY 2005
VII. CONCLUSION In this paper we introduced alternative methods that consider spatial correlation for the computation of the motion parameters of DIRECT modes, which have been proposed and adopted by the H.264 standard. In particular, we introduced a spatial DIRECT mode scheme, which can achieve similar and in some cases better performance compared to the temporal DIRECT mode, but has lesser memory requirements, does not require division, and is timestamp independent. Further methods of jointly considering spatial and temporal correlation, either at the MB or picture/slice level, and modifications to the Rate Distortion Optimization with regards to slices, which can lead to further improvements in coding efficiency, were also presented. REFERENCES [1]
Advanced Video Coding for Generic Audiovisual Services. [Online]. Available: http://www.itu.int/rec/recommendation.asp?type= folders&lang=e&parent=T-REC-H.264
[2] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003. [3] M. Flierl and B. Girod, “Generalized B pictures and the draft H.264/AVC video-compression standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 587–597, Jul. 2003. [4] Information Technology—Generic Coding of Moving Pictures and Associated Audio Information: Video, ISO/IEC Standard 13 818–2:2000. [5] Information Technology—Coding of Audio-Visual Objects—Part 2: Visual, ISO/IEC Standard 14 496–2:2001. [6] F. Kossentini, Y.-W. Lee, M. J. T. Smith, and R. K. Ward, “Predictive RD optimized motion estimation for very low bit-rate video coding,” IEEE J. Sel. Areas Commun., vol. 15, no. 12, pp. 1752–1763, Dec. 1997. [7] A. M. Tourapis, “Enhanced predictive zonal search for single and multiple frame motion estimation,” in Proc. Visual Communications and Image Processing, San Jose, CA, Jan. 2002, pp. 1069–1079. [8] A. M. Tourapis, F. Wu, S. Li, and G. Sullivan, “Timestamp Independent Motion Vector Prediction for P and B Frames With Division Elimination,” Document JVT-D040, 2002. [9] G. Sullivan and T. Wiegand, “Rate distortion optimization for video compression,” IEEE Signal Process. Mag., pp. 74–90, Nov. 1998. [10] Joint Video Team Reference Software, Version 4.2 (JM4.2). [Online]. Available: http://bs.hhi.de/~suehring/tml/download/ [11] G. Bjontegaard, “Calculation of average PSNR differences between RD-curves,” Document VCEG-M33, 2001.