Fast Motion Compensated Temporal Interpolation ... - Semantic Scholar

0 downloads 0 Views 25KB Size Report
Chi-Kong Wong* and Oscar C. Au**. Department of Electrical and Electronic Engineering. The Hong Kong University of Science and Technology. Clear Water ...
Fast Motion Compensated Temporal Interpolation for Video Chi-Kong Wong* and Oscar C. Au** Department of Electrical and Electronic Engineering The Hong Kong University of Science and Technology Clear Water Bay, Hong Kong Email: [email protected]* and [email protected]**

Abstract Recently, MPEG-4 is being formed to study very-low-bit-rate(VLBR) video coding for applications in videotelephony. In this paper, we propose a possible postprocessing technique for VLBR coding. In videophone applications, temporal subsampling is a simple technique which can be combined with other compression schemes to achieve very large compression ratio, so as to satisfy the VLBR requirement. As a result, object motions tend to be jerky and disturbing to the human eyes. To smooth out object motions, we propose a postprocessing technique, motion compensated temporal interpolation (MCTI), to increase the instantaneous decoder frame rate. In MCTI, block-based exhaustive motion search is used to establish temporal association between two reconstructed frames. Both forward and backward searches are used to account for uncovered and newly covered areas properly. With MCTI, we show that one or more frames can be interpolated with acceptable visual quality. After showing the feasibility of MCTI, we propose a fast algorithm FMCTI with reduced computation requirement and negligible performance degradation.

1. INTRODUCTION In video conferencing applications, the bit rate of sending video sequences must be kept low due to the limited channel bandwidth. This is especially stringent in videotelephony applications, of which the target bit rate is considerably less than 64kbit/s. For this reason, the Motion Picture Expert Group is forming the MPEG-4 standard committee to study very-low-bit-rate(VLBR) video compression techniques for bit rates as low as 10 or 20 kbit/s. Many approaches are being explored by researchers worldwide including model-based coding, fractal-based coding, segmentation-based coding, transform-based coding and postprocessing techniques. In this paper, we propose a postprocessing technique for VLBR coding which may be useful for MPEG-4. To satisfy the VLBR requirement of the telephone channels, temporal subsampling is a simple technique which may be combined with other compression schemes such as CCITT H.261 to achieve very large compression ratio. In other cases, temporal subsampling can be a natural event. During a video conferencing session, large or frequent object motions usually result in larger-than-average bit rate which the channel may not be able to handle. This can cause a large backlog in the transmission buffer forcing the encoder to lower the instantaneous frame rate in order to clear the backlog. This is effectively instantaneous temporal subsampling. In any case, the skipped frames need to be reconstructed at the receiver. It is well known that simple frame reconstruction techniques such as frame repetition or linear interpolation introduce disturbing artifacts[1]. Frame repetition generates jerky object motions because object movements are simply not considered and thus not accounted for. Linear interpolation by temporal filtering exhibits blurring in the moving areas because object motions are not considered and pixel values of different object regions used in the interpolation resulting in the blurring of object region boundaries. Object motions must be compensated in order to remove these artifacts. In this paper, we propose to use motion compensated temporal interpolation(MCTI) to reconstruct the skipped frames with considerably less artifacts. In MCTI, we compensate for the object motions by tracking the objects between adjacent received frames. Knowing the trajectory of each object, we can place the object at the appropriate location in the interpolated frames. We use block-based motion estimation to establish blockwise association between each pair of adjacent received frames. The criterion for block matching is mean absolute difference(MAD). Both forward and backward motion estimation would be performed to account for the uncovered regions and the newly-covered regions. These regions can be found in only one but not both of the received frames. The collection of all the motion vectors defines a motion field which is used to set up a database for each interpolated frame. The database would then be used to construct the interpolated frames. The feasibility of the MCTI is verified by simulation using two videoconferencing test sequences.

One disadvantage of MCTI is the huge computation requirement in performing the forward and backward exhaustive motion search, making it impractical. With feasibility of MCTI verified, we propose fast motion compensated temporal interpolation (FMCTI) which is a modified MCTI with much reduced computation. The exhaustive motion search in MCTI is replaced by selective motion estimation. A simple scheme is used to determine if each block is stationary or not. Computation is reduced by performing forward motion estimation on the nonstationary blocks only and by performing backward motion estimation only on the blocks with poor forward motion estimation. Computation of the unidirectional motion estimation is further reduced by pixel decimation and search area subsampling. We show by simulation that FMCTI has similar visual quality as MCTI.

2. ALGORITHMS Here we assume that the object motions are translational and are slow enough so that the motions are approximately linear over time among temporally adjacent frames. We also assume that there is no camera zooming which is usually the case in video-conferencing applications. In addition, we assume that the NxN blocks used are small compared with object sizes. With these assumptions, we can use block motion estimation to establish object motions between adjacent received frames. And with the object trajectories tracked, the objects can be placed at the appropriate location. For simplicity, the criterion for block motion estimation is chosen to be mean absolute difference(MAD). For the present k th frame, we denote the intensity value of the pixel with coordinates

( i, j ) by f k ( i, j ) . We refer to a block of M × N pixels by the coor-

( x, y ) of its upper left corner. The MAD between the block at ( x, y ) of the present received frame and the block at ( x + m, y + n ) of the previous received frame can then be calculated as dinate

M – 1N – 1 1 MAD ( x, y )(m, n) = -------2

∑ ∑

N i=0 j=0

The best match block is defined as minimum of

f k(x + i, y + j) – f k – 1(x + m + i, y + n + j)

MAD ( x, y )(m, n) for all locations ( m, n ) within certain search area.

2.1 Motion-Compensated Temporal Interpolation (MCTI) For any k, our goal is to generate a frame to be inserted between the smoother between adjacent frames. We divide the

( k – 1 ) th and k th frames so that object motions would be

( k – 1 ) th frame, the k th frame and the inserted frame into blocks of size

N × N . Firstly, we perform forward motion estimation. For a block B 1 located at ( x, y ) in the ( k – 1 ) th frame, we define a search area of size

( 2W + 1 ) × ( 2W + 1 ) in the k th frame and perform an exhaustive motion search with MAD being the dis-

tortion measure. If the best match is the block

B 2 located at ( x + dx, y + dy ) in the k th frame, then B 2 should also appear at

dy  x + dx ------, y + ------  in the inserted frame due to assumed linear translational motion. However, the N × N blocks B 3 at  2 2 dx dy  x + -----, y + ------  in the inserted frame usually would not fit into the block grids exactly. Instead it would usually cover four  2 2

N × N blocks C 1, C 2, C 3 and C 4 . To solve this problem, we set up a list of motion vector candidates for each N × N block in the inserted frame. Then for the case discussed here, the forward motion vector

dy  dx ------, ------  is added to the candidate lists of all 2 2

C 1, C 2, C 3 and C 4 together with the area of overlap of each of these blocks C 1, C 2, C 3 and C 4 . For any block in the inserted frame, the motion vector candidates with larger overlapping areas (or shorter distances) should be more reliable than those with smaller overlapping areas (longer distances). The idea is shown in figure 1. Next, we perform backward motion estimation. For each block the

B 1 located at ( x, y ) in the k th frame, a search area is defined in

( k – 1 ) th frame. The same exhaustive search is performed. If the block of the best match, B 2 , is at ( x + dx, y + dy ) of

dx dy ( k – 1 ) th frame, the backward motion vector  ------, ------  is added to the candidate list of the blocks covered by the N × N 2 2 dx dy block B 3 located at  x + ------, y + ------  in the inserted frame.  2 2 the

Figure 1. Frame interpolation by forward motion estimation.

C in the inserted frame will have a list of candidate motion vectors. To choose the best motion vector for block C located at ( i, j ) in the inserted frame, we pick the candidate with the maximum After the forward and backward motion estimation, each block

associated overlapping area. This is the motion vector whose associated block date

B 3 is the closest to the block C . If the best candi-

( v x, v y ) is a forward motion vector, block C is estimated by averaging the blocks located at ( i + v x, j + v y ) in the k th

frame and at block

( i – v x, j – v y ) in the ( k – 1 ) th frame. We assume the residue R , figure 1, of the block B 3 associated with the

C share the same motion information of the block B 3 , since the pixels inside the residue R are very close to the pixels

inside the block

B 3 . However, for case in which the block at ( i + v x, j + v y ) in the k th frame (or at ( i – v x, j – v y ) in the

( k – 1 ) th frame) goes beyond the image boundary, the block located at ( i – v x, j – v y ) in the ( k – 1 ) th frame (or at ( i + v x, j + v y ) in the k th frame) is used instead. Similarly, if ( v x, v y ) is a backward motion vector, block C is estimated by averaging the blocks located at

( i + v x, j + v y ) in the ( k – 1 ) th frame and at ( i – v x, j – v y ) in the k th frame. And if

the block at

( i + v x, j + v y ) in the ( k – 1 ) th frame (or at ( i – v x, j – v y ) in the k th frame) is not completely within the

image boundary, the block at

( i – v x, j – v y ) in the k th frame (or at ( i + v x, j + v y ) in the ( k – 1 ) th frame) would be

used. Once we have a collection of motion field

( v x, v y ) of the inserted frame, the frame can be interpolated.

2.2 Interpolation of M Successive Frames by MCTI With the same argument stated above, we can insert

M frames between the ( k – 1 ) th and k th frames, each block of each of the

M inserted frames will have a candidate list. Let the inserted frames be numbered, such that the 1 st one and M th one are next to the

( k – 1 ) th and k th frames respectively. The forward and backward motion estimation between the ( k – 1 ) th frame and the

k th frame are performed as described in section 2.1. For forward motion estimation, if the block B 1 located at ( x, y ) in the ( k – 1 ) th frame is best matched to the block B 2 located at ( x + dx, y + dy ) in the k th frame, the block B 2 is mapped to the block

i i B 3 located at  x + --------------dx, y + --------------dy in the i th inserted frame. The forward motion vector  M+1 M+1 

M–i+1 M–i+1  --------------------- dx, ---------------------- dy is added to the candidate lists of the four blocks covered by the block B 3 in the i th inserted  M+1  M+1 frame. Similarly, for the backward motion estimation, if the block block

B 1 located at ( x, y ) in the k th frame is best matched to the

B 2 located at ( x + dx, y + dy ) in the ( k – 1 ) th frame, the block B 2 is mapped to the block B 3 located at

i i M–i+1 M–i+1  x + --------------------- dx, y + ---------------------- dy in the i th inserted frame. The forward motion vector  --------------dx, --------------dy is    M+1 M+1  M+1 M+1 added to the candidate lists of the four blocks covered by the block

B 3 in the i th inserted frame. The procedure to find the best can-

didate for each block inserted frame is the same as section 2.1.

2.3 Fast Motion Compensated Temporal Interpolation (FMCTI) The motion search used in the algorithm described in section 2.1 is exhaustive full search which has a huge computation requirement. Therefore it is not suitable for real time applications. In this section, three modules are added into the algorithm which reduce the computation requirement of MCTI by more than two order of magnitude with negligible performance degradation. They are: 1. Classification of stationary and non-stationary blocks, 2. Pixel Decimation, and 3. Search Area Sampling. 2.3.1 Classification of Stationary and Non-stationary Blocks Before we perform forward motion estimation, we check for stationary blocks. For any block

B 1 located at ( x, y ) in the

( k – 1 ) th frame, we compute the MAD between B 1 and the corresponding block B 2 located at ( x, y ) in the k th frame. If the MAD is less than a threshold

T 1 , the block is declared as a stationary block. By classifying blocks as stationary blocks, we avoid

the computational intensive motion estimation to be performed on all stationary blocks. If the MAD is greater than or equal to

T1,

we perform forward motion estimation. After we found the best match block we check the MAD between the blocks match suggesting that

B 3 located at ( x + dx, y + dy ) in the k th frame,

B 1 and B 3 . If the MAD is larger than or equal to certain threshold T 2 , it may be a poor

B 1 may be covered in the k th frame. We thus perform backward motion estimation.

2.3.2 Pixel Decimation With matching a block from the present frame to a block from the previous frame, the matching criterion is usually evaluated using every pixel of the block. Since block matching is based on the assumption that all pixels in a block move by the same mount, a good estimate of the motion could be obtained, in principle, by using only a fraction of the pixels in a block but the accuracy of the motion estimation would not be reduced. The technique of the subsampling of the pixel blocks as a means of reducing computational complexity of the block-matching algorithms has been previously reported by Liu2. Figure 2 shows a pattern of a 16 × 16 block of pixels used in evaluating the MAD. Only half of the pixels in each block are used. A 4 to 1 subsampling ratio is also studied, but it turns out that a blocking interpolated frame results. It may be due to the lack of pixel information in motion estimation. Using the pattern as shown in figure 2 in evaluation the MAD, we can have a reduction by a factor of 2.

Figure 2. Pattern of pixel in pixel decimation. 2.3.3 Search Area Subsampling For the sake of reducing the amount of computation needed for motion search, search area subsampling is a way to save the work3. Here we perform motion estimation with 4 to 1 search area subsampling technique. Since not all the locations within the search area are examined in motion estimation, the “best” match here is not good enough. The best location within the search area may be located at somewhere that we do not perform search and thus resulting loss of the accuracy in motion vectors. To improve the problem, we can perform neighbourhood search around the “best” location. Figure 3 shows the pattern of 4 to 1 search area subsampling. In a ( 2W + 1 ) × ( 2W + 1 ) search area, we only compare the shaded locations in the search area. After we found the “best” match location, we perform neighbourhood search around the “best” location. Then the optimal location within the search area can be found. Using this 4 to 1 search area subsampling as shown in figure 3, we can have a reduction by a factor of 4 in computation requirement.

Figure 3. Pattern of 4 to 1 search area subsampling.

3. SIMULATION RESULTS The proposed algorithms are simulated in the luminous component of the “Miss America” and “Salesman” sequences which are in CIF

( 288 × 352 ) format. The parameters W and N are both 16. Figures 4 and 5 show the results from both MCTI and FMCTI

respectively. For the “Miss America” sequence, three ( M inserted

= 3 ) frames are inserted between the 25 th and 29 th frames. The

26 th and 28 th frames shown in figures 4 and 5 are found to be similar to the corresponding original frames as shown.

( N × N ) blocks, between any two consecutive frames is around 30. So we choose the thresholds T 1 and T 2 to be 14 and 9 respectively. If T 1 is considerably larger than 14 In both sequences, the maximum blockwise MAD, evaluated using all the pixels in the

(e.g. T 1

= 17 ), there will be many blocking artifacts and block mismatch in the reconstructed images for both sequence. The rea-

son for choosing a T 2 less than T 1 is to reduce the possibility of block mismatch after forward search. Using these thresholds, unidirectional motion estimation using FMCTI is performed is 21 and 18 times for “Salesman” and “Miss America” sequences, on the average, per frame. We also have an additional computation reduction by a factor 8 using pixel decimation and search area subsampling. The computation requirements of FMCTI required for interplating a frame, on the average, are shown in Table 1 .

Table 1: Computation Requirements of FMCTI FMCTI Average No. of Blocks needed for Search Average No. of Additions Needed for Interpolating a Frame

Salesman

Miss America

21

18

1.5 × 10 6

1.3 × 10 6

Figure 4. “Salesman” Sequence.

Figure 5. “Miss America” Sequence

For MCTI, since the motion estimation is based on the exhaustive full search, all of the blocks are needed for search. The number of additions required for interpolating a frame is around 4.4 × 10 8 for both sequences. By comparing the number of additions between two methods, we can show that FMCTI reduces the computation requirement of MCTI by more than two orders of magnitude with negligible performance degradation. Figures 6 and 7 show the Peak Signal-To-Noise-Ratio (PSNR) of first 100 frames between MCTI and FMCTI for both “Salesman” and “Miss America” sequences. Assuming the range of the pixel value is from 0 to 255, PSNR is given by the following equation,

M – 1N – 1

∑ ∑

f (x, y) – ˜f (x, y)

x=0y=0 PSNR = – log -------------------------------------------------------------M × N × 255 2 10

where

f (x, y) and ˜f (x, y) are the pixel intensity of the original frame and the estimated frame of size M × N respectively.

Figure 6. PSNR of the first 100 frames of “Salesman” Sequence.

Figure 7. PSNR of the first 100 frames of “Miss America” Sequence. For both sequence, FMCTI performs as good as MCTI except the frames around

10 th and 60 th frames in “Salesman” sequence.

Since there are large motion and some part of those images are blurred (the left hand of the salesman around at 10 th frame), the block matching algorithm using MAD as a measure of the match between two blocks seems not very applicable here. For FMCTI, we use pixel decimation and search area subsampling to reduce the computation complexity of the search, meanwhile, we also loose some information during the search. Hence, PSNR of FMCTI performs worst than MCTI around 10 th and 60 th frames in “Salesman” sequence. For small and moderate motion, FMCTI saves the computation complexity of MCTI by more than two order of magnitude with negligible performance degradation.

4. CONCLUSION In this paper, we present a method to interpolate the skipping frames at receiver as post-processing means from a video sequence and a fast version of the original idea which can save the computation requirement without any performance degradation. The major problem associate with temporal frame interpolation is the moving object at which areas is decovered in the present frame and areas is going to be covered in the covered in the next frame. Using both forward and backward motion searchs to find the motion vector of a block, the problem can be solved. For a video sequence in CIF format, using a

16 × 16 block and 33 × 33 search area, we

can interpolate a frame with around 1.5 × MCTI. Using pixel decimation and search area subsampling together with careful choosing thresholds, we can improve the computation complexity of MCTI without any performance degradation.

10 16 additions using FMCTI which is much less than the number of additions needed in

5. REFERENCES 1. H. G. Musmann, P. Pirsch, and H. J. Graller, “Advances in picture coding,” Proc. of the IEEE, vol. 73, no. 4, pp. 523-548, Apr. 1985. 2. B. Liu and A. Zaccarin, “New algorithms for the estimation of block motion vectors,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 3, no. 2, pp. 148-157, Apr. 1993.

3. Y. H. Fok and O. C. Au, “A fast block matching algorithm in feature domain,” Proc. of IEEE Workshop on Visual Signal Processing and Communications, Melbourne, pp. 199-202, 21-22 Sept. 1993.