mrf-based true motion estimation using h.264 decoding ... - IEEE Xplore

0 downloads 0 Views 652KB Size Report
MRF-BASED TRUE MOTION ESTIMATION USING H.264 DECODING INFORMATION. Yung-Lin Huang1, Yi-Nung Liu2, and Shao-Yi Chien2. Media IC and ...
MRF-BASED TRUE MOTION ESTIMATION USING H.264 DECODING INFORMATION Yung-Lin Huang1, Yi-Nung Liu2 , and Shao-Yi Chien2 Media IC and System Lab Graduate Institute of Networking and Multimedia1 Graduate Institute of Electronics Engineering and Department of Electrical Engineering2 National Taiwan University MD-726, 1, Sec. 4, Roosevelt Rd., Taipei 106, Taiwan ABSTRACT Markov Random Field (MRF) has been successfully used to formulate the energy minimization problems in computer vision. However, a multi-label MRF model such as the conventional true motion estimation approach requires a signiſcant amount of computation due to its large search space. Besides, we observe that decoding information obtained from H.264/AVC could be applied to reduce the computational complexity of true motion estimation. In this paper, a new true motion estimation scheme is proposed. We analyze the motion information and macroblock types from H.264/AVC decoder. According to the decoding information, predictors from the obtained motion vectors (MVs) are selected for MRF models. With these predictors, the search space of MRF could be reduced from O(n2 ) to O(n) compared to conventional full search scheme. Experimental results evaluated on the Middlebury optical ƀow benchmarks show that our proposed scheme is able to optimize the MV ſeld of H.264/AVC decoder to approximate the true motion ſeld.

Y3 Video Frame

Y2

Estimate X3

X4

MV Map X1

X2

Fig. 1. Markov Random Field for true motion estimation problem. Each y represents a pixel in video frame, and each x represents a motion vector of corresponding y.

Index Terms— Markov Random Field, belief propagation, optical ƀow, true motion estimation, H.264/AVC decoder 1. INTRODUCTION Motion estimation (ME) is well-known as an important technique in video coding, and a lot of frameworks are proposed in recent years. The conventional ME in coding system tends to ſnd the corresponding area with lowest residual instead of the true motion trajectory in video sequences. However, in some video applications, e.g. tracking, de-interlacing and frame interpolation, a true MV ſeld is more preferable. Therefore, true motion estimation (TME) schemes, which emphasize the importance of estimating accurate MVs for all of the objects in the video sequences, are proposed [1] [2] [3]. In the frame rate up-conversion (FRUC) application, performance is affected signiſcantly by MV ſeld. In order to integrate TME into current system with lower cost, [4] and [5] reestimate MVs derived from H.264/AVC according to the de-

978-1-4244-8933-6/10/$26.00 ©2010 IEEE

Y1

Y4

99

coding information. Nevertheless, these processing are more likely to be heuristic approaches. Instead of heuristic approaches, TME can be formulated as a pixel-labeling problem as depicted in Fig. 1. The pixellabeling problem, which includes assigning each pixel a label, can be justiſed in terms of maximum a-posteriori estimation of a MRF model. This model has been used in vision problems for several years [6]. The following function shows how to estimate the optimal labels {lp } of corresponding pixels, ⎧ ⎫ ⎬ ⎨  {lp } = arg min Ed (lp ) + Es (lp , lq ) (1) p ⎩ ⎭ p∈P

(p,q)∈N

where Ed is the data term that measures the penalty between the labels and the data, Es is the smoothness term that penalizes the coherence between labels, P is the set of all pixels and N is the relation of neighborhood such as the 4-nearest neighbor pixels. In this paper, a MRF model for TME with H.264 decoding information is proposed. To minimize the energy of MRF, belief propagation [7] is adopted because of its potential for hardware implementation [8]. The organization of this paper is shown below. The proposed algorithm is described in Sec. 2. Next, in Sec. 3, the experimental results will be shown.

SiPS 2010

Video H.264 Encoder H.264 Decoder

(a) Ground Truth MVFt

JM Coded MVFt

N = 16 starts Different strategies at different N NxN : Both the width and height of block are N

NxN MV Pre-processing

NxN Simplified Belief Propagation

Compute NxN Message for Each Predictor

NxN Predictor Selection

if iter < MAX_ITER

MVJM(x-PSR, y+PSR)

MVJM(x+PSR, y+PSR)

PSR

MVGT(x, y) PSR

MVJM(x-PSR, y-PSR)

Yes

MVJM(x+PSR, y-PSR)

No Assign NxN MVF

N = N/2

Yes

(b) Fig. 3. (a) The correspondence between color and MV. Each color represents a MV of one pixel, and the center white represents the zero MV. (b) MV analysis of H.264-coded MV Field. Use similarity check if M VGT (x, y) exists in H.264coded MV map.

if N > 4 No 4x4 MVF

Fig. 2. Flow diagram of proposed algorithm.

check is performed as follows, SimilarM V (x, y) =

Finally, a short conclusion is given in Sec. 4.

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

2. PROPOSED ALGORITHM The framework of the proposed TME is illustrated in Fig. 2. First, decoding information from H.264 is used to determine the candidate MVs. Second, predictors are selected from the candidate MV map. Finally, these predictors are taken as the initial inputs of MRF model and iteratively optimized by belief propagation. Here we use multi-scale block-based operations to make the iterative optimization converged faster.

2.1. Motion Vector Analysis Before describing the proposed algorithm, the MVs derived from H.264 decoder are analyzed. The experimental environment is built on the H.264/AVC reference software Joint Model (JM). Optical ƀow datasets provided by Middlebury website (http://vision.middlebury.edu/ flow/) [9] are used here because the ground truth (GT) MV maps of these datasets are provided. With GT MVs, similar MVs in the H.264-coded MV map are searched within a predictive search range (PSR) as in Fig. 3. To determine if both the GT MV and H.264-coded MV are similar, the similarity

100

1, if M V Dx (M VJM (x , y  ), M VGT (x, y)) < T Hx &&M V Dy (M VJM (x , y  ), M VGT (x, y)) < T Hy , x − P SR < x < x + P SR, y − P SR < y  < y + P SR 0, otherwise (2)

where M V Dx and M V Dy are the difference of MVs in x and y directions, respectively. After the similarity check, the ex

(x,y) × istence of true MV can be calculated as SimilarMV W ×H 100%, where W and H are the width and height of the test sequence, respectively. The experimental results are shown in Fig. 4. Both T Hx and T Hy are set to 1, and PSR ranges from 0 to 64. For each sequence, the existence of true MV is calculated on three H.264-coded MV maps with different ME strategies: Fast full search (FastFS), full search (FS) and enhanced predictive zonal search (EPZS). Fig. 4(a) shows the MV analysis of Urban3 sequence. This sequence has the lowest existence of true MV among eight test sequences, but it still only has at most 5.4% difference between FastFS and EPZS. The MV analysis of eight optical ƀow sequences using FastFS strategy is shown in Fig. 4(b). With higher PSR, the existence of true MV become higher, and most sequences can achieve 100% in a small PSR. According to the results, we have two observations. First, when focusing on the existence of true MVs, the ME strategy, which can be chosen as FastFS, FS or EPZS in H.264 encoder, has little effect on the experimental results. Second, although

MV Analysis of Urban3

100

Existence of True MV

100 Existence of True MV

95 90 85

FastFS

80

FS

75

EPZS

70

MV Analysis of FastFS in H.264 Dimetrodon

95

Hydrangea

90 85

RubberWhale

80

Venus

75

Urban2

70

Urban3

65

Grove2

60

65

̈́0

̈́0 ̈́4 ̈́8 ̈́12 ̈́16 ̈́20 ̈́24 ̈́28 ̈́32 ̈́36 ̈́40 ̈́44 ̈́48 ̈́52 ̈́56 ̈́60 ̈́64

60

Grove3

̈́8 ̈́16 ̈́24 ̈́32 ̈́40 ̈́48 ̈́56 ̈́64

Predictive Search Range

Predictive Search Range

(a) MV analysis of three ME strategies on Urban3

(b) MV analysis of FastFS on Middlebury optical ƀow datasets

the MV ſeld of H.264 decoder is generated from the conventional ME which focuses on removing temporal redundancy, there are still MVs with true motion trajectory in the H.264coded MV ſeld. Consequently, the proposed algorithm aims to re-estimate true MV ſeld from H.264 decoding information. Regardless of the ME strategy in the H.264 encoder, true MV ſeld can be approximated using the proposed algorithm because of the above observations.

Upper Scale

Search Space

Fig. 4. Experimental results of MV analysis. The x-axis represents the PSR, and the y-axis represents the existence of true MV.

Blk Blk

Blk Blk

Current Scale

MV

MV

2.2. Motion Vector Pre-processing In the proposed multi-scale scheme, MVs with different block sizes are required. However, the state-of-the-art video coding standards support variable block sizes. For example, H.264/AVC allows motion from 4x4 to 16x16 blocks. In the proposed algorithm, the block size is ſxed in each scale, so the MVs of variable block sizes must be split and merged for each block size. Therefore, 16x16 MVs are assigned to all the 16x16 blocks at the ſrst scale, 8x8 MVs are assigned to all the 8x8 blocks at the second scale, and 4x4 MVs are assigned to all the 4x4 blocks at the ſnal scale. The block splitting and merging methods are based on the macroblock types obtained from H.264/AVC decoder. To avoid the outlier MVs, the block merging method takes not only the macroblock types but also neighboring MVs into consideration. That is, the chosen MV is the smoothest one in the local area. Although the later global optimization might modify these bad MVs, the pre-processing costs less efforts. 2.3. Predictor Selection As shown in Fig. 2, after the block size is determined and all the initial MVs are assigned from the MV pre-processing stage, the predictors can be selected from the initial MV ſeld. According to the experimental results of MV analysis

101

MV

(a)

MV

(b)

Fig. 5. (a) The strategy of predictor selection, (b) The search space of proposed MRF model. Each block has a set of predictors for its MV labeling.

in Fig. 4, the probability that true MV exists is high with enough PSR. Therefore, we choose PSR=32 here, which means MVs in the range of ±32 are selected as predictors. When the block size is 16, the range of ±32 pixels represents ±2 blocks at both x-direction and y-direction. The strategy of predictor selection and the MRF model of the proposed algorithm are shown in Fig. 5(a). Nine predictors are selected, one is from the MV map in the upper scale and the others are from the MV map in the current scale. Afterwards, The selected predictors form the search space of each node in the MRF model as depicted in Fig. 5(b), and the optimization is operated only on these predictors instead of the candidates in conventional full search space.

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

4x4

Fig. 6. Multi-scale BP concept 2.4. Simpliſed Belief Propagation Belief propagation is chosen to minimize the energy of the MRF model as shown in Fig. 2. The basic concept of belief propagation is to perform message passing operation iteratively and approximate global minimum by local messages. In conventional approach, each pixel requires O(n2 ) computation due to the full-search candidates. However, the proposed algorithm requires only O(n) computation after the search space reduction with predictor selection. In addition, we adopt the multi-scale concept from [7]. Instead of pixel-based operation, 4x4 block is taken as the smallest unit. As depicted in Fig. 6, the belief propagation is operated from the highest scale (16x16 block) to the lowest scale (4x4 block). The sum of absolute difference (SAD) and MV difference are used to represent the data term Ed and smoothness term Es in energy function as follows,  Ed (M V ) = |ft (x, y) − ft+1 (x + M Vx , y + M Vy )| , Es (M Vp , M Vq ) = M V Dx2 + M V Dy2

(3) (4)

where ft is video frame at time t, and ft+1 is video frame at time t + 1. 3. EXPERIMENTAL RESULTS To demonstrate the performance of the proposed TME scheme, we integrate MRF models into the H.264/AVC reference software, JM version 14.0. The well-known Middlebury optical ƀow datasets [9] and different video sequences are used to evaluate the performance of the proposed algorithm.

ſeld smoothing with median ſlter and adaptive overlapped block motion compensation (OBMC). To compare the results, we substitute MV ſelds estimated using proposed algorithm to interpolate the intermediate frame. Fig. 7 shows the peak signal-to-noise ratio (PSNR) evaluation for the FRUC. The performance is similar between two algorithms about the slow motion video (Akiyo). On the other hand, the proposed algorithm has higher PSNR about the camera motion video (mobile calendar) because of the global MV ſeld optimization. About high speed motion video (table tennis) and complex motion video (foreman), little difference is between two algorithms. However, to perform OBME requires full search with an enlarged search range. The proposed algorithm has relative lower computational complexity O(n) compared with OBME and conventional full search TME scheme, and maintains the good performance. 3.2. Motion Vector Field To show proposed algorithm can approximate the true MV ſeld, both video sequences with and without GT MV ſeld are evaluated. The results are shown in Fig. 8 and Fig. 9, respectively. Although the estimated MV ſelds do not completely match the GT, most true MVs in the video are reconstructed. 4. CONCLUSION In this paper, a MRF-based TME scheme is proposed. With the decoding information obtained from H.264/AVC, the computational complexity of the MRF model is reduced. On the other hand, the MV ſeld of H.264/AVC, which is modeled by MRF in the proposed algorithm, is optimized using belief propagation efſciently. The experimental results show that the optimized MV ſeld is plausible for the FRUC application. In the future works, more reusable decoding information and hardware implementation will be involved. Furthermore, because we use less low-level clues than global smoothness constraints, the local motion is less accurate than the one estimated using optical ƀow methods. It is also an important issue if the application requires more accurate MVs.

Acknowledgment Part of this project is supported by Himax Technologies, Inc. 5. REFERENCES

3.1. Frame Rate Up Conversion To ensure the MV ſelds estimated by the proposed algorithm are suitable for video processing applications, we implement a FRUC framework based on [4]. This framework includes bidirectional overlapped block motion estimation (OBME), MV

102

[1] G. de Haan, P. W. A. C. Biezen, H. Huijgen, and O. A. Ojo, “True-motion estimation with 3-D recursive search block matching,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, no. 5, pp. 368–379, Oct. 1993.

table_tennis.yuv

Akiyo.yuv 60 50 OBME+Median Filter

30

PSNR

PSNR

40

Proposed Algorithm

20 10

OBME+Median Filter Proposed Algorithm

1 12 23 34 45 56 67 78 89 100 111 122 133 144

0

45 40 35 30 25 20 15 10 5 0

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 Frame Number

Frame Number

(a) Akiyo.yuv (slow motion)

(b) table tennis.yuv (fast motion)

foreman.yuv

mobile_calendar.yuv 50

33

32

40 OBME+Median Filter

30 29

PSNR

PSNR

31

30

OBME+Median Filter

20 Proposed Algorithm

Proposed Algorithm

28

10

27

0 1 18 35 52 69 86 103 120 137 154 171 188

26

1 12 23 34 45 56 67 78 89 100 111 122 133 144

25

Frame Number

Frame Number

(c) mobile calendar.yuv (camera motion)

(d) foreman cif.yuv (complex motion)

Fig. 7. PSNR evaluation of FRUC with OBME + Median Filter and the proposed algorithm. [2] J.Wang, D. Wang, and W. Zhang, “Temporal compensated motion estimation with simple block-based prediction,” IEEE Trans. Broadcast., vol. 49, no. 3, pp. 241– 248, Sept. 2003. [3] Shen-Chuan Tai, Ying-Ru Chen, Zheng-Bin Huang, and Chuen-Ching Wang, “A multi-pass true motion estimation scheme with motion vector propagation for frame rate up-conversion applications,” Journal of Display Technol., vol. 4, no. 2, pp. 188–197, June 2008. [4] Ya-Ting Yang, Yi-Shin Tung, and Ja-Ling Wu, “Quality enhancement of frame rate up-converted video by adaptive frame skip and reliable motion extraction,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 12, pp. 1700–1713, Dec. 2007. [5] Ai-Mei Huang and Truong Nguyen, “Correlation-based motion vector processing with adaptive interpolation scheme for motion-compensated frame interpolation,”

103

IEEE Trans. Image Processing, vol. 18, no. 4, pp. 740– 752, April 2009. [6] William T. Freeman, Egon C. Pasztor, and Owen T. Carmichael, “Learning low-level vision,” Int. J. Comput. Vision, vol. 40, no. 1, pp. 25–47, October 2000. [7] Pedro F. Felzenszwalb and Daniel P. Huttenlocher, “Efſcient belief propagation for early vision,” Int. J. Comput. Vision, vol. 70, no. 1, pp. 41–54, 2006. [8] Chia-Kai Liang, Chao-Chung Cheng, Yen-Chieh Lai, Liang-Gee Chen, and Homer H. Chen, “Hardwareefſcient belief propagation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009, pp. 80–87. [9] S. Baker, D. Scharstein, J.P. Lewis, S. Roth, M.J. Black, and R. Szeliski, “A database and evaluation methodology for optical ƀow,” in Proc. 11th Int. Conf. on Computer Vision (ICCV), Oct. 2007, pp. 1–8.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 8. (a) The 4th frame of Urban2, and MV ſelds of Urban2 from 4th frame to 5th frame: (b) Ground truth, (c) using proposed MRF-based TME, (d) from H.264 decoder. (e) The 1st frame of Venus, and MV ſelds of Venus from 1st frame to 2nd frame: (f) Ground truth, (g) using proposed MRF-based TME, (h) from H.264 decoder.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 9. (a)(b) The 4th & 5th frame of Backyard, and MV ſelds of Backyard from 4th frame to 5th frame: (c) Using proposed MRF-based TME, (d) from H.264 decoder. (e)(f) The 4th & 5th frame of Backyard, and MV ſelds of Backyard from 4th frame to 5th frame: (g) Using proposed MRF-based TME, (h) from H.264 decoder.

104

Suggest Documents