328
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
Refreshment Need Metrics for Improved Shape and Texture Object-Based Resilient Video Coding Luis Ducla Soares, Student Member, IEEE, and Fernando Pereira, Senior Member, IEEE
Abstract—Video encoders may use several techniques to improve error resilience. In particular, for video encoders that rely on predictive (inter) coding to remove temporal redundancy, intra coding refreshment is especially useful to stop error propagation when errors occur in the transmission or storage of the coded streams, which can cause the decoded quality to decay very rapidly. In the context of object-based video coding, the video encoder can apply intra coding refreshment to both the shape and the texture data. In this paper, shape refreshment need and texture refreshment need metrics are proposed which can be used by object-based video encoders, notably MPEG-4 video encoders, to determine when the shape and the texture of the various video objects in the scene should be refreshed in order to improve the decoded video quality, e.g., for a given bitrate. Index Terms—Concealment difficulty, error resilience, intra refreshment, object-based coding.
I. CONTEXT AND OBJECTIVES
W
ITH THE availability of the MPEG-4 object-based audiovisual coding standard [1], new multimedia services and devices are making their way into the market. In particular, there is an increasing interest in mobile multimedia services, such as mobile videotelephony and mobile video streaming. However, mobile networks are highly error-prone environments, which makes it difficult, if not impossible, to transmit highly compressed video with an acceptable quality without appropriate error resilience techniques. Error resilience techniques, and especially error concealment, are usually seen as playing a role at the decoder side of the communication chain. However, this is only partly true since the encoder and the bitstream syntax itself play an important role on what can be achieved at the decoder side in terms of error concealment. For instance, it is largely recognized that intra coding refreshment can play a major role in improving error resilience in video coding systems that rely on predictive (inter) coding to remove temporal redundancy because, in these conditions, the decoded quality can decay very rapidly due to error propagation if errors occur in the transmission or storage of the coded streams. Therefore, in order to avoid error propagation for too long a time, the encoder can use an intra coding refreshment scheme to refresh the decoding process and stop (spatial and
Manuscript received March 5, 2002; revised November 22, 2002. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Tamas Sziranyi. The authors are with Instituto Superior Técnico, Instituto de Telecomunicações, 1049-001 Lisboa, Portugal (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TIP.2003.809018
temporal) error propagation. The refreshment process will certainly decrease the coding efficiency, but it will significantly improve error resilience at the decoder side, increasing the overall subjective impact [2]. In order to design an efficient intra coding refreshment scheme for an object-based coding scheme, it would be helpful for the encoder to have a method to determine which components of the video data (shape and texture) of which objects should be refreshed and when, especially if based on an appropriate refreshment need measure. The refreshment need for a given visual data entity, such as a shape or a texture macroblock for macroblock-based coding schemes, is a metric that measures the necessity of refreshing this visual data entity according to some error resilience criteria and can be used by the encoder to decide if a given visual entity should be refreshed or not at a certain point in time. In the following, shape refreshment need (SRN) and texture refreshment need (TRN) metrics are proposed; these metrics are to be integrated in a complete intra coding refreshment scheme, considering both shape and texture. Since these metrics characterize the video information in terms of refreshment need by analyzing the available data, this is a case of video processing to obtain metadata, which is then used for resilient coding (metadata for coding). In the past, some metrics have been defined with similar objectives for frame-based video, which means that shape information was never considered, only the (rectangular) texture. In [3], it is suggested that intra coding refreshment could be applied only to the parts of the image that would be difficult to conceal (with temporal concealment techniques) in the case that errors occurred. This means that only the blocks that are very different from their co-located ones in the previous frame would be refreshed; the mean squared difference of blocks is used as a simple measure of this difficulty. This idea would later lead to the Adaptive Intra Refreshment (AIR) technique suggested in the MPEG-4 standard [1] for frame-based video coding. However, in the AIR technique, the Sum of Absolute Differences (SAD) is used to determine if a block should be refreshed or not. If the SAD value is greater than a certain threshold, the block is considered to be an important area that should be refreshed. These important blocks will then be cyclically refreshed, instead of being refreshed all at once as in [3]. In both of these cases, the metrics used by the encoder to decide if a given block should be refreshed only consider the characteristics of the blocks and do not take into account the bitstream related aspects, such as the number of bits that were spent for a block or the probability of those bits being in error.
1057-7149/03$17.00 © 2003 IEEE
SOARES AND PEREIRA: IMPROVED SHAPE AND TEXTURE OBJECT-BASED RESILIENT VIDEO CODING
329
In [4] and [5], another scheme is proposed where the intra refreshment rate for different parts of the image is adaptively adjusted to the channel conditions and video characteristics. This scheme is based on an error sensitivity metric, which is accumulated at the encoder and represents the vulnerability of each block to channel errors. This metric considers the characteristics of the block (assuming that temporal concealment is used) as well as some bitstream aspects, such as the probability of the texture data having errors. The intra refreshment decisions are made according to the values of this accumulated error sensitivity metric for each block.
because the dependency of the SRN on the SEV and the SCD should be such that when either one of them approaches zero, so should the SRN. For instance, if the considered shape is extremely easy to conceal (SCD approaching zero), the SRN should also approach zero independently of the SEV. On the other hand, if the shape is not vulnerable to errors (because almost no bits were spent or the amount of channel errors is very small), the SRN should also approach zero independently of the SCD.
II. DEFINING SHAPE REFRESHMENT NEED
It is proposed here that the shape error vulnerability, SEV, be measured by the fraction of shape bits that will have to be discarded due to errors in VOP . This depends on the type and amount of errors, the number of bits used and how they are arranged in the bitstream, and the decoder’s error detection capabilities. Moreover, coded VOPs can be divided in several independently decodable video packets (VP) to improve resynchronization and thus error resilience. In particular, for the MPEG-4 standard, the video information in a VP can be arranged according to two possible strategies, which are further detailed in [1]:
In MPEG-4 shape coding [1], the only way to completely eliminate the shape dependency on the past is by refreshing the entire shape of a video object plane (VOP). There are two reasons for this. The first one is that, unless the entire shape of the VOP is intra coded, the shape coding mode of a given block will always be inter coded, even if the shape data itself is intra coded, which means that it does not stand on its own and depends on previous coding modes. The second one is that the prediction error for an inter coded shape block cannot be decoded without errors if the shape block used as prediction was not reconstructed without errors. Therefore, although shape coding is block-based, the SRN metric has to be VOP-based since VOP intra refreshment is the only way to guarantee full shape refreshment. This way, by considering the most important factors imparameter measuring the SRN for pacting the SRN, the VOP in a video object sequence, can be defined as
(1) is the Shape Error Vulnerability (SEV), is where is the scaled the Shape Concealment Difficulty (SCD), is the Spatial Shape Concealment Difficulty (SSCD), scaled Temporal Shape Concealment Difficulty (TSCD), all is a function that combines the for VOP . Additionally, and (both with values between 0 values of and 1) into a single SCD value. By using scaled versions to the same range ([0,1]) of SSCD and TSCD it becomes easier to combine them into a final SCD metric, possibly as a weighted sum. If SSCD and TSCD had different ranges, extra constants would have to be introduced in the expression of SCD to compensate for the different ranges of the two concealment difficulty metrics. To obtain the scaled version (with values between 0 and 1) of a given metric , the following expression is used: (2) is the practical limit of the metric. is defined as a product where the first factor (i.e., the SEV) measures the statistical exposure of the shape data to errors, already considering the used bitstream structure, and the second factor (i.e., the SCD) expresses how hard the shape is to recover using concealment techniques. A product is used
where
A. Defining Shape Error Vulnerability
• Combined mode—All the information (shape, motion and texture) is multiplexed at the macroblock level, justifying the name of “combined mode”. • Data partitioning—The VP is divided into two separate partitions where the first one holds the shape and motion data multiplexed at the macroblock level and the second one the texture data (also multiplexed at the macroblock level). This strategy is the preferred one when using errorprone channels since more sophisticated error concealment solutions may be used and, therefore, it will be the only one considered in the remainder of this paper. Here, it will be conservatively assumed that if a given VP has at least one error in the first partition (including the shape and motion data) the whole shape data in the VP will be discarded; otherwise, no shape data bits will be discarded. This corresponds to the simplest possible recovery strategy at the decoder and it is assumed that all decoders can do it. Under this parameter becomes assumption, the expression for the
shape of VP in VOP is discarded
(3)
is the probawhere is the VP number within VOP , is the number of VP’s in VOP , and bility of event, is the percentage of shape bits (for VOP ) that are carried in VP . In order to determine the probability of the shape data in one VP being discarded, the possible error patterns have to be considered. Here, as it happened for the relevant MPEG experiments [6], three types of errors are considered. 1) Uniform Errors: In this case, the shape data in a given VP is discarded if at least one bit error exists in the first VP partition. be the probability of a given bit being in error (i.e., bit Let
330
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
error rate or BER), then the probability of the shape data in the VP being discarded is given by: shape of the VP is discarded
(4)
where is the number of bits in the first partition, including the resync marker which indicates the beginning of the VP, the VP header, the shape data, the motion data and the marker which indicates the end of the first partition. 2) Burst Errors: In this case, the shape data of the VP is discarded if at least one error burst starts within the first partition. Other less likely situations may also lead to the discarding of the shape data of the VP, but these cases will be not be considered be the probability of a burst starting at a given bit, here. Let then the probability of the shape data in the VP being discarded with . is given by (4) by replacing When a two-state Markov chain is used to model the burst of a burst errors (i.e., the Gilbert model [7]), the probability starting at a given bit is simply given by the transition probability (hence the subscript in ) from the state outside the burst to the state inside the burst [8], which depends on the average BER, the BER within the burst (i.e., burst BER) and the burst length in bits burst length
average burst
average
(5)
3) Packet Losses: Similarly to what was done for burst errors, the shape data of the VP is discarded if at least one multiplex packet starting inside the first partition is lost. By assuming that the multiplex packets that can be lost have a fixed payload length (this restriction will be relaxed later) and that the start of the first multiplex packet (starting inside the first partition) coincides with the first bit in the first partition of the VP, then the probability of at least one lost multiplex packet starting inside the first partition of the VP is simply given by (4) by replacing with the packet loss rate, PLR, and with the number of multiplex packets starting within the first partition of the VP assuming that the start of the first multiplex packet coincides with . However, the start the first bit (bit 0) in the first partition, of the first multiplex packet (starting inside the first partition) does not necessarily coincide with the first bit in the first partition of the considered VP; it can coincide with equal probability with any of the first bits, depending on the variable number of bits that were spent in the previous VPs. This way, the probability of at least one lost packet starting inside the first partition of the VP is given by shape of the VP is discarded (6) is the number of multiplex packets starting within where the first partition of the VP assuming that the start of the first multiplex packet (starting inside the partition) coincides with bit in the first partition. If the payload size of the used packets is not fixed, then instead of , the average packet payload size should be used.
B. Defining Shape Concealment Difficulty As for the second factor in (1), it measures the shape concealment difficulty, SCD, which depends on the concealment techniques used at the receiver. Since these techniques can be spatial and/or temporal, the SCD in (1) is expressed as a function of and which has to be defined according to the used error concealment solutions. Therefore, in order to have a , the encoder needs to know the type precise definition of of concealment that is being used at the decoder. If this information is not available, the encoder will simply have to assume that the concealment is the simplest possible, in order to ensure proper working conditions for all kinds of decoders in terms of error concealment capabilities. Two relevant concealment solutions are as follows. • Static shape concealment—In this case, the shape concealment scheme used at the decoder is always the same and known beforehand, notably by the encoder; therethat expresses the shape concealfore, the function ment difficulty is fixed and defined in advance. As an example, in a simple case where only temporal concealshould ment is used at the decoder, the function and should be simply equal to the not depend on parameter. • Dynamic shape concealment—In this case, the shape concealment scheme used at the decoder is adaptively chosen in order to get the best concealment results. Therefore, it is impossible for the encoder to know exactly what type of concealment scheme (spatial, temporal or a combination of both) is being used at a given time instant (since the decoder selection will also depend on the channel errors). A possible solution is to assume that the decoder is “intelligent” and always decides to use the concealment scheme that will yield the best results, which appears to be a sensible assumption. This way, the encoder can obtain a good estimate of the shape concealment difficulty by considering that the function corresponds to the minimum of the spatial, temporal and spatio-temporal shape concealment difficulty. While the first two concealment difficulties are given and parameters, respectively, by the the spatio-temporal concealment difficulty, which is the concealment difficulty when a combination of both spatial and temporal concealment techniques are used, will have to be estimated. This can be done by considering and parameters and the values for the how much improvement can be obtained by combining both types of techniques. Another solution is to assume that the decoder only implements a rather simple error concealment technique (e.g., only temporal concealment without motion compensation). Whenever nothing is known about the decoder concealment capabilities, this simple solution represents a conservative approach since it provides an intra refreshment solution without risks in the sense that sufficient refreshment will always be present; this assurance may have to be paid for with some decrease in terms of compression efficiency, sometimes
SOARES AND PEREIRA: IMPROVED SHAPE AND TEXTURE OBJECT-BASED RESILIENT VIDEO CODING
Fig. 1.
331
Shapes with different intrinsic shape complexity: (a) low complexity and (b) high complexity.
Fig. 2. CSS image formation [9].
not really necessary because the decoder is more “intelligent” than expected. 1) Spatial Shape Concealment Difficulty: term in (1) expresses a) Defining SSCD: The how difficult it is to recover the shape in question by only using spatial information (no video information from past time instants). In order to define an efficient SSCD metric, it is essential to consider the factors that have a major impact on the spatial shape concealment difficulty. This way, it is proposed parameter (i.e., SSCD for VOP ) be defined that the as the following product:
that the description tools provided by the MPEG-7 multimedia content description standard can also be useful for video encoding, and not only for multimedia retrieval and filtering. The CSS data extraction process is illustrated in Fig. 2, which shows a contour at two different stages of the smoothing process (after 20 and 80 passes of the low-pass filter). Next to the contours is the CSS image obtained from the contour evolution, where the contour curvature zero-crossings (A through H) and the corresponding points on the CSS image are marked. Since the number of CSS peaks, as well as their heights, are related to the shape parameter be defined complexity, it is proposed that the as
(7) is the Intrinsic Shape Complexity (ISC) where, for VOP , is the Spatial Grid Factor (SGF). and ISC expresses the complexity of the shape associated to its spatial variation. For instance, if a given fraction of the shape has been lost, it is much easier to conceal if the shape to recover is smooth rather than edgy with very fast changes along its contour. This can be seen in Fig. 1, where two shapes with different intrinsic complexities are shown. As for the ISC computation, it is proposed to base it on the contour-based shape descriptor adopted by the MPEG-7 Visual standard [9]. This contour-based shape descriptor is defined using the Curvature Scale Space (CSS) representation of the contour [10]. This type of use of metadata information shows
(8) where is the index corresponding to the peak number of the is the amplitude of peak CSS representation, and in the CSS description for the shape of VOP . As in the MPEG-7 Visual standard [9], the amplitude of a given peak in VOP is determined based on the CSS image, as follows: (9) where to create peak
is the number of iterations that are necessary in the CSS image and is the
332
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
Fig. 3.
(a) Video object with (6) holes and (b) its multiple contours (7).
number of equidistant samples that are used to represent the contour; in this case, the number of samples used to represent the contour is the same as the number of points in the original contour, which ensures an accurate representation of the contour. The above ISC definition is valid for objects with a single contour, i.e., with only one connected component and no holes. However, many objects have more than one contour;1 this means alpha planes with nonconnected regions and/or holes, such as the Cyclamen object in Fig. 3. As can be seen, when the contour is extracted from the alpha plane, multiple contours (in this case, 7) appear due to the 6 holes (the seventh is the outer contour). Therefore, the CSS representation has to be computed for all the individual contours; in that case, the total ISC is obtained by adding the peak amplitudes for all the contours of the object in question. The SGF factor relates to the way the shape fits in the coding grid imposed by the video codec. Although this factor is closely related to the shape size, the shape size itself is not a good concealment difficulty measure in itself; in fact, if a shape with half the resolution in both directions is encoded using an 8 8 grid, the SCD will be about the same as with the original shape being encoded with a 16 16 grid. This, of course, is true if the detail loss due to the resolution reduction of the contour is not taken into account, which appears to be a good approximation for shapes that are not extremely complex (i.e., less complex than the Cyclamen object shown in Fig. 1). For these extremely complex shapes, when the resolution decreases so does their complexity because a lot of the detail is removed and the shapes become smoother, which means that they are also easier to conceal. In order to define the SGF, it should be noticed that the use of a block-based coding grid implies the existence of three types of shape blocks: transparent blocks (all shapels in the block are transparent), opaque blocks (all shapels in the block are opaque) and border blocks (blocks with both opaque and transparent shapels inside). Of the three types of blocks, the border blocks correspond to the hard case in terms of shape concealment since an arbitrary opaque-transparent frontier has to be defined. Therefore, when considering two shapes with the same ISC values, the one with a larger percentage of border blocks is considered harder to conceal. Similarly, if two shapes with the same ISC values and the same percentage of border blocks are 1A contour is here understood as a closed planar curve that defines the limits of one connected component or region of an object. If an object has any holes or more than one connected component, then several such contours are needed to define the boundaries of the object.
considered, the one with the higher ratio of opaque shapels inside border blocks should be considered harder to conceal. This is so, because if border blocks in the object that has a larger percentage of opaque shapels in border blocks are lost, a larger percentage of the whole shape will be affected. Therefore, the is proposed measure for (10) with number of border blocks total number of border and opaque blocks number of opaque shapels in border blocks total number of opaque shapels
(11) (12)
The two factors, and , express the effects described above in terms of spatial shape concealment. Since these factors vary between 0 and 1, so will their product. However, these factors are not completely independent and therefore their simple multiplication would yield an artificially small value. To avoid this undesirable effect, the square root is used. b) Evaluating the SSCD Performance: In order to illustrate the performance of the proposed spatial shape concealment difficulty metric, SSCD of (7), results are given in Table I for the shapes with different complexities shown in Fig. 4, which correspond to the first VOP of the object sequences Cyclamen, Tennis Player, Weather Girl, and Logo (from the News sequence). The results in Table I show that the proposed SSCD metric behaves rather well in terms of expressing the spatial concealment difficulty because it ranks the Weather Girl, Tennis Player, Cyclamen, and Logo objects as increasingly difficult to conceal with spatial concealment techniques. In addition, as expected, SSCD values go up as the resolution goes down, notably because the percentage of border blocks increases. The only exception is the Logo object, whose SSCD goes up with the resolution. This can be explained by considering two facts. First of all, the SGF is 1 for both resolutions because the object is so complex that no opaque blocks exist, only border blocks. And second, since this object is extremely complex, when the resolution decreases a lot of detail is lost, meaning that the object becomes smoother and, therefore, easier to conceal. Since the highest spatial shape concealment difficulty value, SSCD, for the Logo object (the hardest object to conceal with spatial techniques in the MPEG-4 test set) is 13.504, the practical limit of the SSCD values will be considered to be 14. This
SOARES AND PEREIRA: IMPROVED SHAPE AND TEXTURE OBJECT-BASED RESILIENT VIDEO CODING
333
TABLE I SPATIAL SHAPE CONCEALMENT DIFFICULTY MEASURES FOR VARIOUS TEST SHAPES, IN ADDITION TO THEIR INTRINSIC SHAPE COMPLEXITY AND SPATIAL GRID FACTOR
Fig. 4. Shapes with different spatial concealment difficulty in their bounding boxes (QCIF) with the coding grid (16 Player, (c) Weather Girl, and (d) Logo.
way, (2) can be used to determine the scaled metric by to 14. setting 2) Temporal Shape Concealment Difficulty: term in (1) expresses c) Defining TSCD: The how difficult it is to conceal the shape in question by using only temporal information. In order to define the TSCD metric, it is essential to consider the factors that have a major impact on the temporal shape concealment difficulty. This way, it is proposed parameter (i.e., TSCD for VOP ) be defined as that the the following product: (13) is the Temporal Shape Stability (TSS) where, for VOP , is the Temporal Grid Factor (TGF). and TSS expresses the amount of shape variation from one time instant to the next because a given fraction of the shape that has been lost will be much easier to conceal if the shape to recover has changed very little with respect to the previous time instant. To express the amount of change in the shape, a good metric seems to be the number of different shapels in the current and previous alpha planes for a given object. However, this metric measures only the absolute change and the concealment difficulty is also related to the relative amount of change. Therefore, the number of changed shapels relative to the size of the object
2 16) overlaid: (a) Cyclamen, (b) Tennis
appears to be a good metric, resulting for (see (14) at the bottom of the page). As for TGF, it expresses the way the shape fits in the coding grid imposed by the video codec and the comments that were made in Section II.B.I regarding the SGF also apply here. In order to define TGF, it should be noticed that the use of a block-based coding grid implies the existence of two types of shape blocks in terms of temporal stability: stable blocks (blocks whose shape does not change) and unstable blocks (blocks whose shape changes). Of these two cases, the unstable blocks correspond to the hard case in terms of temporal shape concealment because the co-located block in the previous time instant cannot be simply copied. Therefore, when considering two shapes with the same TSS values, the one with a larger percentage of unstable blocks is considered harder to conceal. Similarly, if two shapes with the same TSS values and the same percentage of unstable blocks are considered, the one with the higher ratio of opaque shapels inside unstable blocks should be considered harder to conceal. This is so, because if unstable blocks in the object that has a larger percentage of opaque shapels in unstable blocks are lost, a larger percentage of the whole shape will be affected. Therefore, similarly to is Section II-B1 the proposed definition for
number of different shapels in current and previous VOP number of opaque shapels in current VOP
(15)
(14)
334
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
2
Fig. 5. Shapes with different temporal concealment difficulties in their bounding boxes with coding grid (16 16) overlaid: (a) frames 0 and 3 of Cyclamen; (b) frames 180 and 182 of Tennis Player; (c) frames 0 and 3 of Weather Girl; (d) frames 0 and 3 of Logo; (e) frames 66 and 69 of Large Boat; and (f) frames 69 and 72 of Small Boat. TABLE II TEMPORAL SHAPE CONCEALMENT DIFFICULTY MEASURES FOR VARIOUS TEST SHAPES, IN ADDITION TO THEIR TEMPORAL SHAPE STABILITY AND TEMPORAL GRID FACTOR
with number of unstable blocks total number of border and opaque blocks number of opaque shapels in unstable blocks total number of opaque shapels
(16) (17)
d) Evaluating the TSCD Performance: To show the performance of the temporal shape concealment difficulty metric, TSCD of (13), results are given in Table II for the objects with different temporal stabilities shown in Fig. 5.
The results in Table II show that the proposed TSCD metric behaves rather well in terms of expressing the temporal shape concealment difficulty because it ranks the Logo, Weather Girl, Cyclamen, Large Boat, Tennis Player, and Small Boat objects as increasingly harder to conceal with temporal concealment techniques. As for the dependency of the results on the spatial resolution, the expected behavior is also confirmed. For Tennis Player and Small Boat, the TSCD values remain practically unchanged when the resolution increases because the TGF parameter remains equal to 1 (all the blocks are unstable). For Logo,
SOARES AND PEREIRA: IMPROVED SHAPE AND TEXTURE OBJECT-BASED RESILIENT VIDEO CODING
335
TABLE III SHAPE REFRESHMENT NEED MEASURES FOR THE STEFAN SEQUENCE OBJECTS, IN ADDITION TO THEIR SHAPE ERROR VULNERABILITY AND SCALED TEMPORAL SHAPE CONCEALMENT DIFFICULTY (TSCD )
TABLE IV SHAPE REFRESHMENT NEED MEASURES FOR THE WEATHER SEQUENCE OBJECTS, IN ADDITION TO THEIR SHAPE ERROR VULNERABILITY AND SCALED TEMPORAL SHAPE CONCEALMENT DIFFICULTY (TSCD ).
the TGF parameter is 0 independently of the resolution (all the blocks are stable) as well as the TSS parameter (object does not change in time), which explains the obtained TSCD value. For the remaining objects (Cyclamen, Weather Girl and Large Boat), the TGF parameter decreases when the resolution is increased, which explains the decrease in the TSCD values. Since the highest temporal shape concealment difficulty value, TSCD, for the Small Boat object (the test object with the highest TSCD value, occurring in frame 72) is 1.9368, to metric, as in Section II-B1, (2) determine the scaled to 2. will be used by setting C. Evaluating the SRN Performance To show the performance of the shape refreshment need metric proposed in this paper, SRN of (1), results are given in Table III for the object shown in Fig. 5(b) and the corresponding Background, which were extracted from the Stefan sequence (QCIF at 192 kbps) and in Table IV for the object shown in Fig. 5(c) and the corresponding background (i.e., Weather Map), which were extracted from the Weather sequence (QCIF at 56 kbps). These results were obtained for 2 and 4 VPs per VOP and for two different burst error patterns: one with an average BER of 10 (Burst 1) and the other with an average BER of 10 (Burst 2). In both cases, the BER inside the burst is 0.5 and the burst length is 10 ms. In addition, the encoder will consider that temporal concealment techniques are used function is simply at the decoder, which means that the . equal to The results in Tables III and IV show that the proposed SRN metric behaves rather well in terms of expressing the shape re-
freshment need for the conditions tested. In fact, for the Stefan sequence, the Background object has a much lower SRN than Tennis Player, meaning that the encoder should refresh the latter much more often. For the Weather sequence, however, the two objects have much closer SRN values, meaning that the encoder should refresh the Weather Girl object only slightly more often than the Weather Map object. Additionally, when the amount of errors increases, the SEV, as well as the SRN, increases, implying that the objects need to be more refreshed. When less VPs are used per VOP, the objects also need to be more refreshed (SRN increases), which is understandable because VP’s become larger and, thus, more vulnerable to errors (e.g., less resynchronization available). III. DEFINING TEXTURE REFRESHMENT NEED To eliminate the texture dependency on the past it is not necessary to refresh the entire VOP as happens for MPEG-4 shape coding [1]. This occurs because the macroblock texture coding modes are always intra coded and the decodability of the prediction error does not depend on the correctness of the texture macroblock used as prediction. Therefore, the TRN metric can be defined at the macroblock level. This way, by considering pathe most important factors impacting the TRN, the rameter measuring the TRN of macroblock in VOP , can be defined as
(18)
336
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
where is the Texture Error Vulnerability (TEV), is the Texture Concealment Difficulty (TCD), is the scaled Spatial Texture Concealment Difficulty (STCD), is the scaled Temporal Texture Concealment Difficulty (TTCD), all for macroblock in VOP . Additionally, is a function that combines the values of and (both with values between 0 and 1 for the reasons explained in Section II) into a single TCD value. Here, macroblocks are counted in a raster scan order starting from the top-left corner of the bounding box. is defined as a product where the first factor (i.e., the TEV) measures the statistical exposure of the texture data to errors already considering the used bitstream structure; and the second factor (i.e., the TCD) measures how hard the texture is to recover using error concealment techniques. A product is used because the dependency of TRN on TEV and TCD is similar to the dependency of SRN on SEV and SCD, which was explained in Section II. A. Defining Texture Error Vulnerability It is proposed here that the texture error vulnerability, TEV, be measured by the probability of the texture bits of macroblock having to be discarded due to errors in VOP . Therefore, becomes
texture of macroblock
of VOP is discarded
marker at the end of the first partition and the texture data. For burst errors, it is also given by (4) but by replacing with and with . For packet losses, the expression is given by with , which is the number of (6), by replacing multiplex packets starting within the considered VP assuming that the start of the first multiplex packet (starting inside the VP) coincides with bit in the VP. B. Defining Texture Concealment Difficulty As for the second factor in (18), it measures the texture concealment difficulty, TCD, which depends on the concealment techniques used at the receiver. Since these techniques can be spatial and/or temporal, the TCD in (18) is expressed as a funcof and which has to be defined action cording to the error concealment solutions used at the decoder. Similarly to what was said for the shape concealment difficulty in Section II-B, the possible texture concealment solutions are either static or dynamic. 1) Spatial Texture Concealment Difficulty: term in (18) expresses e) Defining STCD: The how difficult it is to recover the texture of the macroblock in question by only using spatial information (and so, no video information from past time instants). It is proposed here that the definition of the STCD metric be based on the Spatial perceptual Information (SI) metric defined in ITU-T Recommendation P.910 [11], which is computed as
(19)
(20)
where is the macroblock number within VOP , and is the probability of event. Similarly to , in order to define , it is necessary to consider the structure of one VP in (the data partitioning mode of) MPEG-4, which has been detailed in Section II.A. Here, similarly to what was done for shape, a conservative recovery strategy is considered, to assume that all decoders can perform it. This recovery strategy implies that if a given VP has at least one error in the first partition (including the shape and motion data), the whole data in the VP will be discarded; moreover, if a given VP has at least one error in the second partition (conveying the texture data), the whole texture data in the VP will be discarded. Since both cases correspond to the discarding of the complete texture data in the VP, one can say that if a given VP has at least on error, the whole texture data in the VP will be discarded; otherwise, no texture data bits will be discarded. Therefore, the probability of discarding the texture bits of a given macroblock (MB) is the same as that of discarding the entire texture in the corresponding VP. The probability of the texture data of one macroblock being discarded can be determined for the same relevant error patterns that were used in Section II-A. Since the probability expressions to be derived here for the texture are very similar to those derived in Section II-A for the shape, only the differences between them will be stated here. The probability of the texture data of a given macroblock being discarded, for uniform errors, is given by (4) with , which is the number of bits in the by replacing corresponding VP, including the resync marker at the beginning of the VP, the VP header, the shape data, the motion data, the
is the where is the luminance component of frame , is the standard deviation over the Sobel filter operator, is the maximum value over pixels of one frame and time. The SI metric was developed by ITU-T to evaluate the amount of spatial information on test scenes and corresponds to an edginess measure. This measure is computed by first determining an edginess value for each frame in the sequence (the standard deviation of the Sobel-filtered image) and then choosing its maximum value to represent the whole sequence. The major difference between STCD and SI is that STCD has to be an instantaneous value for a given macroblock and not the maximum becomes over time as SI. Therefore, (21) is the luminance component of macroblock in VOP where . In this case, the luminance space is the set of pixels for each macroblock for which there is a shape support (opaque shapels). Only the luminance is used here due to its primordial relevance in terms of the Human Visual System (HVS). f) Evaluating the STCD Performance: In order to illustrate the performance of the proposed spatial texture concealment difficulty metric, STCD of (21), results are included below for the texture macroblocks with different complexities shown in Fig. 6. They correspond to decoded macroblocks extracted from the objects Weather Girl, Tennis Player and Background (from the Stefan sequence). These macroblocks were encoded with a quantization step of 2 (the lowest possible in MPEG-4
SOARES AND PEREIRA: IMPROVED SHAPE AND TEXTURE OBJECT-BASED RESILIENT VIDEO CODING
337
Fig. 6. Macroblocks with different complexities: (a) and (b) macroblocks 71 and 15 of Weather Girl (frame 0, CIF); (c) macroblock 14 of Background (frame 0, CIF); and (d) macroblock 14 of Tennis Player (frame 0, CIF).
TABLE V SPATIAL TEXTURE CONCEALMENT DIFFICULTY MEASURES FOR DIFFERENT TEXTURE MACROBLOCKS.
Visual and other video coding standards) in order to guarantee that the reconstructed texture is the closest possible to the original one, preserving most of the original spatial detail (i.e., a worst-case situation in terms of spatial detail is created). Just by looking at the texture macroblocks shown in Fig. 6, an error concealment expert can rank them according to their spatial concealment difficulty, starting with (c) as the most difficult followed by (d), (b), and finally, (a). This order should be reflected by the STCD values presented in Table V, indicating that the proposed objective metric reaches the target of automatically expressing how difficult it is to conceal a given texture macroblock with only spatial concealment techniques. The analysis of the results in Table V leads to the conclusion that the proposed STCD metric behaves well in terms of expressing the spatial texture concealment difficulty, as needed. In spite of its theoretical maximum (i.e., 127.5), it can be shown that the typical STCD values are much lower. The most textured macroblock found (in VOP 7 of the QCIF version of Stefan’s Background object) reaches a STCD value of metric, as 93.9206. Therefore, to determine the scaled to 94. in Section II.B.I, (2) will be used by setting 2) Temporal Texture Concealment Difficulty (TTCD): term in (18) expresses a) Defining TTCD: The how difficult it is to conceal the texture macroblock in question by using only temporal information. This difficulty can be simply measured by the SAD between the current and the previous time instants defined as (22)
is the where is the macroblock number within VOP , is luminance component of macroblock of VOP , and
the luminance component of the macroblock co-located with in VOP . The texture concealment difficulty when temporal concealment techniques are used is basically dictated by the amount of changes in the texture from one time instant to the next. Therefore, a good way to measure this is by means of the differences in luminance values corresponding to opaque shapels in the current and previous alpha planes, which can be easily done with the SAD metric. This, of course, does not mean that other metrics are not adequate, such as the Mean Square Error (MSE). The SAD could also be computed after motion compensating the prediction macroblocks from the previous time instant. However, this would mean assuming that the decoder can use motion compensation for temporal concealment, which is not always true. The decoder may not to be able to do so because it either lacks the computational power (i.e., it is a very simple decoder) or the motion vectors might not be available (e.g., because they were corrupted by errors). b) Evaluating the TTCD Performance: In order to illustrate the performance of the proposed temporal texture concealment difficulty metric, TTCD of (22), results are shown below for the texture macroblock pairs in Fig. 7 with different amounts of temporal change. They correspond to reconstructed macroblocks extracted from the Cyclamen, Weather Girl and Tennis Player objects with a frame rate of 10 fps (the original material is at 30 fps). These macroblocks were encoded with a quantization step of 2 for the reasons already highlighted. By looking at the texture macroblock pairs displayed in Fig. 7, an error concealment expert can rank them according to their temporal concealment difficulty, starting with (d) as the hardest to conceal with temporal techniques, followed by (c), (a), and finally, (b). This order should be clearly reflected by the TTCD values, shown in Table VI, indicating that the proposed objective metric reaches the target to automatically
338
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
Fig. 7. Texture macroblocks (up) with co-located ones in the previous VOP (down): (a) Cyclamen — macroblock 104 of VOP 3 and VOP 0; (b) Weather Girl — macroblock 15 of VOP 3 and VOP 0; (c) Tennis Player — macroblock 19 of VOP 99 and VOP 96; and (d) Tennis Player — macroblock 20 of VOP 99 and VOP 96. TABLE VI TEMPORAL TEXTURE CONCEALMENT DIFFICULTY MEASURES FOR DIFFERENT TEXTURE MACROBLOCKS (CIF)
express how difficult it is to conceal a given texture macroblock with temporal concealment techniques. The analysis of the results in Table VI leads to the conclusion that the proposed TTCD metric behaves well in terms of expressing the temporal texture concealment difficulty, as needed. Once again, it can be shown that the typical TTCD values are much lower than the theoretical TTCD maximum (i.e., 65 280). The most unstable texture macroblock found (in frame 98 of the CIF version of Tennis Player) reaches a TTCD value of 56 055 metric, as in and, therefore, to determine the scaled to 57 000. Section II-B1, (2) will be used by setting C. Evaluating the TRN Performance To show the performance of the texture refreshment need metric proposed in this paper, TRN of (18), results are given in Table VII for two different macroblocks from VOP 3 of the Background object from the Stefan sequence (QCIF at 192 kbps), when 2 and 4 VPs are used per VOP and for the same two different burst error patterns used in Section II-C. One macroblock corresponds to macroblock 22 in the VOP, which is located in the (very inhomogeneous) audience part (MB1), while the other corresponds to macroblock 71 in the VOP and is located in the (homogeneous) court part (MB2). In Table VIII, results are given for two different macroblocks from VOP 3 of the Weather Girl object from the Weather sequence (QCIF at 56 kbps), for the same conditions as above.
One macroblock corresponds to macroblock 7 (MB3) in the VOP and the other to macroblock 11 (MB4), being both located in the face of the Weather Girl. In both cases, the encoder considers that temporal concealment techniques are used at the decoder, which means that the function is simply equal . to The results in Tables VII and VIII show that the proposed TRN metric is able to adequately express the texture refreshment need for the conditions here tested. In fact, MB2 has a much lower TRN than MB1, meaning that the encoder should refresh the latter much more often, as could be expected. As for MB3 and MB4, their TRN values are much closer, meaning that the encoder should refresh only slightly more often the former. Additionally, when the amount of errors increases, the TEV, as well as the TRN, increases, implying that the macroblocks need to be more refreshed. When less VPs are used per VOP, the macroblocks also need to be more refreshed (TRN increases), which is understandable because VPs become larger and, thus, more vulnerable to errors (there is less resynchronization and more bits are lost each time there is an error). IV. FINAL REMARKS This paper proposes both shape and texture refreshment need metrics, which express the necessity of refreshing the shape and the texture data while coding a given video object. While the VOP-based shape refreshment need metric, SRN, is defined as
SOARES AND PEREIRA: IMPROVED SHAPE AND TEXTURE OBJECT-BASED RESILIENT VIDEO CODING
339
TABLE VII TEXTURE REFRESHMENT NEED MEASURES FOR TWO DIFFERENT MACROBLOCKS IN THE STEFAN SEQUENCE, IN ADDITION THEIR TEXTURE ERROR VULNERABILITY AND SCALED TEMPORAL TEXTURE CONCEALMENT DIFFICULTY
TO
TABLE VIII TEXTURE REFRESHMENT NEED MEASURES FOR TWO DIFFERENT MACROBLOCKS IN THE WEATHER SEQUENCE, IN ADDITION THEIR TEXTURE ERROR VULNERABILITY AND SCALED TEMPORAL TEXTURE CONCEALMENT DIFFICULTY
a product of the shape error vulnerability and the shape concealment difficulty, the macroblock-based texture refreshment need metric, TRN, is similarly defined as a product of the texture error vulnerability and the texture concealment difficulty. In both cases, the error vulnerability measures the statistical exposure of the shape/texture data to channel errors and the concealment difficulty expresses how difficult the corrupted shape/ texture data is to recover when both spatial and temporal error concealment techniques are considered. Since the results have shown that the proposed metrics are able to correctly express the necessity of refreshment, it is anticipated that they can be extremely useful in allowing the encoder to efficiently decide which video objects should have their shapes and texture macroblocks intra coded to improve the subjective impact at the decoder side. For instance, in terms of shape, the encoder can decide to increase or decrease the shape refreshment periods of the various video objects in a scene, depending on their SRN values. Similarly, in terms of texture, the encoder can distribute the available texture refreshment resources among the various video objects in a scene, depending on the average TRN values of the macroblocks in each video object, and then distribute the assigned texture refreshment resources within each VOP based on the individual TRN values of the macroblocks. This objective is extremely important, especially for emerging mobile applications, where error condi-
TO
tions are challenging and bandwidth resources are scarce and thus maximizing the quality is a must. REFERENCES [1] Information Technology — Coding of Audio-Visual Objects, ISO/IEC 14 496, 1999. [2] L. Ducla-Soares and F. Pereira, “Error resilience and concealment performance for MPEG-4 frame-based video coding,” Signal Process.: Image Commun., vol. 14, no. 6-8, pp. 447–472, May 1999. [3] P. Haskell and D. Messerschmitt, “Resynchronization of motion compensated video affected by ATM cell loss,” in Proc. ICASSP’92, vol. 3, San Francisco, CA, 1992, pp. 545–548. [4] J. Y. Liao and J. D. Villasenor, “Adaptive intra update for video coding over noisy channels,” in Proc. Int. Conf. Image Processing (ICIP’96), vol. 3, Lausanne, Switzerland, Sept. 1996, pp. 763–766. , “Adaptive intra block update for robust transmission of H.263,” [5] IEEE Trans. Circuits Syst. Video Technol., vol. 10, pp. 30–35, Feb. 2000. [6] Report on Error Conditions for Error Resilience Core Experiments, ISO/IEC JTC1/SC29/WG11 N1223, 1996. [7] E. N. Gilbert, “Capacity of a burst-noise channel,” Bell Syst. Tech. J., vol. 39, pp. 1253–1265, Sept. 1960. [8] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, Feb. 1989. [9] Text of 15 938–3 (MPEG-7 Visual FDIS), ISO/IEC JTC1/SC29/WG11 N4203, 2001. [10] F. Mokhtarian and S. Abbasi, “Shape-Based indexing using curvature scale space with affine curvature,” in Proc. CBMI’99, Toulouse, France, Oct. 1999, pp. 255–262. [11] Subjective Video Quality Assessment Methods for Multimedia Applications, ITU-T Recommend. P.910, 1996.
340
Luis Ducla Soares (S’98) was born in Lisbon, Portugal, in 1973. In 1996, he graduated in electrical and computer engineering from Instituto Superior Técnico (IST), Universidade Técnica de Lisboa, Portugal, where he is currently pursuing the Ph.D. degree. He is currently a Teaching Assistant at the Information Science and Technology Department of Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE), Portugal, and a Member of Research Staff of the Telecommunications Institute, Portugal. His research interests include object-based error resilience as well as video coding.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 3, MARCH 2003
Fernando Pereira (M’90–SM’99) was born in Vermelha, Portugal, in 1962. He graduated in electrical and computer engineering from Instituto Superior Técnico (IST), Universidade Técnica de Lisboa, Portugal, in 1985. He received the M.Sc. and Ph.D. degrees in electrical and computer engineering from IST, in 1988 and 1991, respectively. He is currently Professor in the Electrical and Computer Engineering Department of IST. He is responsible for the participation of IST in many national and international research projects. He is a member of the editorial board and Area Editor on image/video compression of the Signal Processing: Image Communication. He is a member of the scientific committees of several international conferences. He has contributed more than 100 papers to journals and international conferences. His current areas of interest are video analysis, processing, coding and description, and multimedia interactive services. Dr. Pereira won the 1990 Portuguese IBM Award and an ISO Award for Outstanding Technical Contribution for his participation in the development of the MPEG-4 Visual Standard, in October 1998. He is an Associate Editor of the IEEE TRANSACTIONS OF CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, the IEEE TRANSACTIONS ON IMAGE PROCESSING, and IEEE TRANSACTIONS ON MULTIMEDIA. He has been participating in the work of ISO/MPEG for many years, notably as the head of the Portuguese delegation, and chairing many ad hoc groups related to the MPEG-4 and MPEG-7 standards.