Fast Adaptive Block Based Motion Estimation for ... - OhioLINK ETD

3 downloads 0 Views 3MB Size Report
Apr 17, 2009 - Chapman & Hall, New York, 1996. [22] V. Bhaskaran and K. ...... block_comp=ceil(block_comp); ... frame_fil_comp = ceil(fil_frame_comp);.
Fast Adaptive Block Based Motion Estimation for Video Compression

A dissertation presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Doctor of Philosophy

Yi Luo June 2009 © 2009 Yi Luo. All Rights Reserved

2 This dissertation titled Fast Adaptive Block Based Motion Estimation for Video Compression

by YI LUO

has been approved for the Department of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Mehmet Celenk Associate Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology

3 ABSTRACT

LUO, YI, Ph.D., June 2009, Electrical Engineering Fast Adaptive Block Based Motion Estimation for Video Compression (200 pp.) Director of Dissertation: Mehmet Celenk In this dissertation, a new block-based motion estimation (ME) method is proposed which uses the Kalman filtering (KF) with adaptive block partitioning (ABP) to improve the motion estimates resulting from the conventional block-matching algorithms (BMAs). In our work, a first-order autoregressive (AR) model is applied to the motion vectors (MVs) obtained by BMAs. A new approach is developed for adaptively adjusting the state parameters of the Kalman filter and the motion correlations between neighboring blocks are referred to predict motion information. According to the statistics of the frame MVs, 16x16 macro-blocks (MBs) are split into 8x8 blocks or 4x4 sub-blocks adaptively for fine grain operation of the Kalman filtering. To further improve the performance of MV prediction, we adopt a zigzag scanning of blocks or sub-blocks and the state parameters of the Kalman filter are updated successively during each iteration in accordance with the outcome of the zigzag based block or sub-block scanning. The experimental results indicate that the proposed method can effectively improve the ME performance in terms of the peak-signal-to-noise-ratio (PSNR) of the motion compensated images with smoother motion vector fields as compared to the existing approaches in the literature. The scheme described herein is also tested on high resolution video samples yielding least motion artifacts in the reconstructed image frames. Such

4 robust performance makes it an ideal temporal redundancy extraction engine for a wide variety of video transmissions and new digital TV applications. From micro to nano-scale medium development point of view, however, the block-based KF motion prediction is not the most cost effective and fastest approach for mobile video communications and computing devices including the third (G3) and fourth (G4) generation technology standards. To this end, as a second part of the research, we focused on developing a fast binary partition tree based variable size video coding system. New adaptive algorithms proposed herein are applied to a video encoder with binary partition trees. First, to reduce the computation for block-matching, an adaptive search area method is described which adjusts the searching region according to the size of each block. Second, an early termination method is introduced which terminates the binary partitioning process adaptively according to the statistics of the peak-signal-to-noise-ratio values during each step of block splitting. Third, we put forward a new model for fast rate-distortion (R-D) estimation to decrease the computation of matching pursuit (MP) coding for residual images. Simulation results show that the proposed techniques provide significant gain in computation speed with little or no sacrifice of R-D performance, when compared with non-adaptive binary partitioning scheme.

Approved: _____________________________________________________________ Mehmet Celenk Associate Professor of Electrical Engineering and Computer Science

5 ACKNOWLEDGMENTS First, I would like to express my most sincere gratitude to my advisor, Dr. Mehmet Celenk for his constant encouragement and guidance throughout my graduate study and research at Ohio University. Without his support and help, this thesis work is impossible. I am also grateful to my doctoral advisory committee members and college representatives: Drs Jeffrey Dill, Maarten Uijt de Hagg, Jundong Liu, Vardges Melkonian, and Hans Kruse for their reviewing my dissertation and providing many corrections and helpful comments. I would like to take this opportunity to thank many of my friends for their support and encouragement, especially Yan Qin, Yinyin Liu, Shuisheng Xie, Limin Ma, Xin Jin, Yufeng Zhai, Yong Liu, and Li Li. I am grateful to my parents and my sister for their understanding and patience through the entire period of my study. Without those support, this thesis work is impossible in many ways.

6 TABLE OF CONTENTS Page ABSTRACT....................................................................................................................... 3 ACKNOWLEDGMENTS ................................................................................................ 5 LIST OF TABLES ............................................................................................................ 9 LIST OF FIGURES ........................................................................................................ 11 ACRONYMS AND ABBREVIATIONS....................................................................... 14 CHAPTER 1:

INTRODUCTION............................................................................... 16

1.1

Basics of a Video Coding System..................................................................... 16

1.2

Challenges......................................................................................................... 19

1.3

Major Contributions.......................................................................................... 21

1.4

Dissertation Organization ................................................................................. 23

CHAPTER 2: A NEW SPATIO-TEMPORAL CORRELATION BASED ROOD SEARCH FOR MOTION ESTIMATION ................................................................... 24 2.1

Representation of Transformations................................................................... 24

2.2

Motion Estimation in Video Processing ........................................................... 29

2.2.1 Region Matching Methods............................................................................ 32 2.2.2 Methods Based on Spatio-Temporal Constraint ........................................... 34 2.3

Block-Matching for Motion Estimation and Compensation............................. 35

2.3.1 Block-Matching and Motion Estimation ...................................................... 35 2.3.2 Block-Based Motion Compensation ............................................................. 40 2.3.3 Block-Matching Algorithms ......................................................................... 42

2.4

2.3.3.1

Full Search ............................................................................................ 44

2.3.3.2

Three-Step Search................................................................................. 45

2.3.3.3

Four-Step Search................................................................................... 46

2.3.3.4

Diamond Search.................................................................................... 46

2.3.3.5

Adaptive Rood Pattern Search .............................................................. 47

A New Block-matching Method using Rood Pattern and Spatio-Temporal

Correlation .................................................................................................................... 49

7 2.4.1 Introduction................................................................................................... 49 2.4.2 Candidate Neighbouring Block Selection and MV Prediction ..................... 50 2.4.3 Search Pattern and Strategy .......................................................................... 58 2.4.4 Summary of ARS-ST Method ...................................................................... 60 2.4.5 Experimental Results .................................................................................... 61 CHAPTER 3: BLOCK BASED MOTION ESTIMATION USING ADAPTIVE KALMAN FILTERING................................................................................................. 67 3.1

Introduction of the Kalman filter ...................................................................... 67

3.1.1 Background Introduction .............................................................................. 67 3.1.2 Signal Estimation and Discrete Wiener Filter .............................................. 68 3.1.3 Discrete Kalman Filter.................................................................................. 72 3.2

Block Based Motion Estimation using Adaptive Kalman Filtering ................. 81

3.2.1 Background Overview .................................................................................. 81 3.2.2 Introduction................................................................................................... 83 3.2.3 Motion Modeling and State-Space Representation....................................... 83 3.2.4 A New Model for Adaptive Kalman Filtering (New-AKF) ......................... 88 3.2.5 Summary of the Proposed Method ............................................................... 93 3.2.6 Experimental Results .................................................................................... 94 3.2.6.1

Experimental Results Based on 1-D AR Model ................................... 95

3.2.6.2

Experimental Results Based on 2-D AR Model ................................... 99

CHAPTER 4: KALMAN FILTERING BASED MOTION ESTIMATION WITH ADAPTIVE BLOCK PARTITIONING ..................................................................... 104 4.1

Introduction..................................................................................................... 104

4.2

Algorithm Overview ....................................................................................... 105

4.2.1 Adaptive Block Partition Algorithm........................................................... 105 4.2.2 Zigzag Scan for Blocks and Sub-blocks ..................................................... 109 4.2.3 Motion Modeling and State-Space Representation..................................... 110 4.2.4 Kalman Filtering with Adaptive State Parameters Adjustment.................. 111 4.2.5 Summary of AKF2-ABP Method ............................................................... 112 4.3

Experimental Results ...................................................................................... 113

8 CHAPTER 5: AN ADAPATIVE VIDEO CODING SYSTEM BASED ON BINARY TREE BLOCK PARTITION AND MOTION ESTIMATION ............... 123 5.1

Overview of the Binary Tree Video Coding System ...................................... 124

5.1.1 Frame Type and Frame Encoding............................................................... 124 5.1.2 Binary Tree Block Partition and Motion Estimation .................................. 125 5.1.3 Rate-Distortion Control .............................................................................. 127 5.1.4 Entropy Coding and Transform Coding ..................................................... 129

5.2

5.1.4.1

Arithmetic Coding .............................................................................. 129

5.1.4.2

Matching Pursuit (MP) ....................................................................... 130

Binary Tree Block Partition with Adaptive Search Area Adjustment............ 132

5.2.1 Method Overview ....................................................................................... 132 5.2.2 Simulation Results ...................................................................................... 135 5.3

An Adaptive Early Termination Method for Binary Tree Block Partition..... 142

5.3.1 Algorithm Overview ................................................................................... 142 5.3.2 Simulation Results ...................................................................................... 150 5.4

A fast Rate-Distortion Cost Estimation Model for Matching Pursuit Video

Coding ......................................................................................................................... 158 5.4.1 Algorithm Overview ................................................................................... 158 5.4.2 Simulation Results ...................................................................................... 163 5.5

Summary ......................................................................................................... 166

CHAPTER 6:

CONCLUSIONS AND FUTURE WORK ...................................... 170

6.1

Summary and Conclusions ............................................................................. 170

6.2

Future Work .................................................................................................... 172

REFERENCES.............................................................................................................. 174 APPENDIX A: A EXAMPLE OF ARITHMATIC CODING.................................. 181 APPENDIX B: SAMPLE FRAMES OF TEST VIDEO SEQUENCES .................. 186 APPENDIX C: MATLAB CODE FOR SIMULATION........................................... 192

9 LIST OF TABLES Page Table 2.1 Average number of search points for each macro-block .................................. 62 Table 2.2 Computational gain of ARS-ST over FS, DS, and ARPS ................................ 63 Table 2.3 Average PSNR(dB) performance ..................................................................... 64 Table 3.1 Average PSNR(dB) of the first 30 P frames of the test video sequences with TSS as the BMA ............................................................................................................... 97 Table 3.2 Average PSNR(dB) of the first 30 P frames of the test video sequences with DS as the BMA ....................................................................................................................... 98 Table 3.3 Computation required for each macro-block .................................................... 99 Table 3.4 Average PSNR(dB) of the first 30 frames of test video sequences with DS as the BMA.......................................................................................................................... 101 Table 3.5 Average PSNR(dB) of the first 30 frames of test video sequences with ARPS as the BMA ..................................................................................................................... 101 Table 4.1 Partition of macro-blocks based on the values of std_mv and ave_SAD....... 109 Table 4.2 Average PSNR (dB) of the first 50 P frames of the test video sequences using DS and adaptive Kalman filtering................................................................................... 117 Table 4.3 Average PSNR (dB) of the first 50 P frames of the test video sequences using ARPS and adaptive Kalman filtering.............................................................................. 118 Table 4.4 Average PSNR (dB) of the first 30 P frames of the test video sequences using DS and adaptive Kalman filtering................................................................................... 120 Table 5.1 Number of search points using fixed search parameter and adaptive method1 for the first P frame ......................................................................................................... 139 Table 5.2 Number of search points using fixed search parameter and adaptive method1 for the first B frame......................................................................................................... 139 Table 5.3 Number of search points using fixed search parameter and adaptive method2 for the first P frame ......................................................................................................... 139 Table 5.4 Number of search points using fixed search parameter and adaptive method2 for the first B frame......................................................................................................... 140 Table 5.5 Average computation reduction and speedup ratio for P and B frames ......... 140 Table 5.6 Average number of blocks for block partition at different QP values for “Carphone” (QCIF)......................................................................................................... 155 Table 5.7 Average number of blocks for block partition at different QP values for “Mobile” (QCIF)............................................................................................................. 155

10 Table 5.8 MP encoding computation reduced by applying the MP optimization method for the P frames in the first 30 frames of “Carphone” .................................................... 165 Table 5.9 MP encoding computation reduced by applying the MP Optimization method for the B frames in the first 30 frames of “Carphone”.................................................... 165 Table 5.10 Overall encoding time reduced by New-BPT over BPT for “Carphone”..... 168 Table 5.11 Overall encoding time reduced by New-BPT over BPT for “Mobile”......... 168 Table 7.1 Alphabet table and probabilities for each symbol .......................................... 181 Table 7.2 Probabilities for the symbols (a, b, c) to be encoded...................................... 182 Table 7.3 Intervals for each symbol in the encoding process ......................................... 183 Table 7.4 Intervals for the decoding process for Z = 0.595............................................ 185

11 LIST OF FIGURES Page Figure 1.1 An example of a digital video system. ............................................................ 17 Figure 1.2 Block diagram of a typical video encoder....................................................... 18 Figure 2.1 An object with a motion of translation of (dx, dy) from t-1 to t0. .................... 24 Figure 2.2 An object with a motion of scaling from t-1 to t0............................................. 26 Figure 2.3 An object with a motion of rotation by an angle φ about the t axis from t-1 to t0. ........................................................................................................................................... 27 Figure 2.4 An object with a motion of translation, scaling, and rotation from t-1 to t0..... 28 Figure 2.5 Block-matching with search parameter p = P. ................................................ 37 Figure 2.6 Examples of matching error surface................................................................ 38 Figure 2.7 An example sequence of MPEG frames and the inter-frame dependencies. .. 40 Figure 2.8 An illustration of motion compensation. ......................................................... 42 Figure 2.9 Three-step search procedure for the motion vector (1,-7)............................... 45 Figure 2.10 Four-step search procedure for the motion vector (3,-5)............................... 46 Figure 2.11 Diamond search procedure for the motion vector (4,-2). .............................. 47 Figure 2.12 Adaptive rood pattern search: The predicted motion vector is (3, -1) and the rood arm length is 4. ......................................................................................................... 48 Figure 2.13 The current block i and its neighboring blocks. ............................................ 50 Figure 2.14 The current block “A” and its neighbouring blocks “B”, “C”, “D”, and “E” in spatial domain. “a” , “b” “c”, “d”, and “e” are the corresponding blocks in the reference frame. ................................................................................................................................ 53 Figure 2.15 Block diagram for motion vector prediction. ................................................ 57 Figure 2.16 New adaptive rood pattern search: The predicted motion vector is (-2,-1) and the rood arm length Rh = 3, Rv = 2. ................................................................................... 60 Figure 2.17 PSNR performance of “Foreman.”................................................................ 65 Figure 2.18 PSNR performance of “Bus.”........................................................................ 66 Figure 3.1 Block diagram for signal estimation................................................................ 69 Figure 3.2 Block diagram for the Kalman filter state model. ........................................... 74 Figure 3.3 Predictor-corrector cycle in a discrete Kalman filter. ..................................... 79 Figure 3.4 Operation of the Kalman filter. ....................................................................... 81

12 Figure 3.5 PSNR performance of the first 30 frames of “Foreman.” ............................... 95 Figure 3.6 PSNR performance of the first 30 frames of “Bus.” ....................................... 96 Figure 3.7 PSNR performance of the first 30 frames of “Carphone” with DS as the BMA. ......................................................................................................................................... 100 Figure 3.8 PSNR performance of the first 30 frames of “Carphone” with ARPS as the BMA. .............................................................................................................................. 100 Figure 4.1 Partition of a 16x16 MB into 8x8 blocks or 4x4 sub-blocks......................... 106 Figure 4.2 An illustration of the two features that affects the performance of block partition. .......................................................................................................................... 107 Figure 4.3 Different scan orders for 8x8 blocks and 4x4 sub-blocks. ............................ 110 Figure 4.4 MVs of the 25th frame of “Carphone” sequence generated by DS and AKF2 on different block partition. ............................................................................................ 115 Figure 4.5 PSNR performance of the first 50 P frames of “Football” with DS as the BMA and search parameter p = 15. .......................................................................................... 116 Figure 4.6 PSNR performance of the first 50 P frames of “Bus” with ARPS as the BMA and search parameter p = 15. .......................................................................................... 116 Figure 4.7 PSNR performance of the first 30 P frames of “Ice” with DS as the BMA and search parameter p = 15. ................................................................................................. 120 Figure 4.8 PSNR performance of the first 30 P frames of “Shields” with DS as the BMA and search parameter p = 15. .......................................................................................... 121 Figure 5.1 Block diagram of the video encoder.............................................................. 125 Figure 5.2 Partition of a block using binary tree. ........................................................... 127 Figure 5.3 Block partition of the first P frame of “Carphone (QCIF).”.......................... 137 Figure 5.4 Block partition of the first B frame of “Carphone (QCIF).” ......................... 138 Figure 5.5 PSNR performance of “Carphone”(QCIF) at different bit rates. .................. 141 Figure 5.6 PSNR performance of “Mobile Calendar ”(QCIF) at different bit rates....... 142 Figure 5.7 Block partition and binary tree refining for the first P frame of “Carphone” with QCIF format and QP = 36. ..................................................................................... 144 Figure 5.8 Block partition and binary tree refining for the first B frame of “Carphone” with QCIF format and QP=36. ....................................................................................... 145 Figure 5.9 Plots of the standard deviation of PSNR and F for the first P frame of “Carhpone” with QP = 36. .............................................................................................. 148 Figure 5.10 Plots of the standard deviation of PSNR and F for the first B frame of “Carhpone” with QP = 36. .............................................................................................. 149

13 Figure 5.11 Block partition for the P1, B1, and B2 frames of “Carhpone” with QP=36. ......................................................................................................................................... 151 Figure 5.12 Plots of cost J for “Carhpone” after refining procedure with or without early termination. ..................................................................................................................... 152 Figure 5.13 Optimal number of blocks for “Carhpone” after refining procedure with or without early termination................................................................................................ 153 Figure 5.14 PSNR performance of “Carphone”(QCIF) at different bit rates with the first 49 frames......................................................................................................................... 156 Figure 5.15 PSNR performance of “Mobile Calendar”(QCIF) at different bit rates with the first 49 frames. .......................................................................................................... 157 Figure 5.16 Matching pursuit encoding of the first tile of the 4th frame (P) of “Carphone” with QP = 32. .................................................................................................................. 160 Figure 5.17 Distortion deduction curve after an average filtering with the window size of 5 for the 4th P frame with QP = 32.................................................................................. 162 Figure 5.18 PSNR performance of “Carphone” with or without MP optimization........ 163 Figure 5.19 Plots of cost J for “Carphone” with or without MP optimization. .............. 164 Figure 5.20 R-D performance of “Carphone” with or without MP optimization........... 164 Figure 5.21 Rate-distortion performance of different encoder for “Carphone.” ............ 167 Figure 5.22 Rate-distortion performance of different encoder for “Mobile Calendar.”. 167 Figure 7.1 An example of arithmetic coding for symbols stream “baca.”...................... 184

14

ACRONYMS AND ABBREVIATIONS ARPS

Adaptive Rood Search Pattern

AVC

Advanced Video Coding

BPT

Binary Tree Partition

BM

Block-Matching

BMA

Block-Matching Algorithm

CABAC

Context Adaptive Binary Arithmetic Coding

CCITT

International Telegraph and Telephone Consultative Committee

CIF

Common Intermediate Format (resolution 352×288)

DCT

Discrete Cosine Transform

DPCM

Differential Pulse Code Modulation

DS

Diamond Search

DV

Digital Video

DWT

Discrete Wavelet Transform

FS

Full Search

HVS

Human Visual System

IEC

International Electrotechnical Commission

ISO

International Organization for Standardization

ITU

International Telecommunication Union

JPEG

Joint Photographic Experts Group

JVT

Joint Video Team

KF

Kalman Filtering

LQE

Linear quadratic estimation

M-JPEG

Motion JPEG

MAD

Mean Absolute Error

MB

Macro-Block

MC

Motion Compensation

ME

Motion Vector

15 MOS

Mean Opinion Score

MSE

Mean Squared Error

MV

Motion Estimation

MP

Matching Pursuit

MPEG

Moving Picture Experts Group

NTSS

New Three-Step Search

PSNR

Peak-Signal-to-Noise-Ratio

QCIF

Quarter Common Intermediate Format (resolution 176×144)

QP

Quantization Parameter

R-D

Rate-Distortion

SAD

Sum of Absolute Differences

SIF

Source Interchange Format (resolution 352×240)

SQCIF

Sub-quarter CIF (resolution 128×96))

SSD

Sum of Squared Differences

TSS

Three-Step Search

VLC

Variable Length Coding

WSS

Wide-Sense Stationary

16 CHAPTER 1: INTRODUCTION In nowadays, visual information is playing a more and more important role in our daily life and affects our way of communication and living in many aspects. Digital images and video applications usually involve storage or transmission of vast amount data. With the advances of digital technologies, this becomes more feasible than before. The start of the research of digital image and video coding can be dated back to 1950s and 1960s, and spatial differential pulse code modulation (DPCM) [1, 2] was utilized to encode still images. Transform coding techniques were studied in 1970s and the wellknown block-based discrete cosine transform (DCT) [3, 4] was proposed by Ahmed et al. In 1980s, motion estimation and compensation techniques were applied to video coding which provide a significant compression gain over DCT based pure intra-frame coding (i.e., JPEG [5-7]) for video compression. Discrete wavelet transform (DWT) [8-12] were also studied since 1980s for image and video coding and have become the core technology for the JPEG2000 [13-16] still image coding standard. Today, modern image and video processing techniques have been employed in many fields such as digital TV broadcast, surveillance system, medical imaging for disease diagnosis, image/videobased web search, etc. 1.1

Basics of a Video Coding System

An illustration of a typical digital video system is given in Figure 1.1. Usually, the source of a video is captured by a camera or a video recorder in analog or digital format. For the video in analog format, it needs to be converted to digital format before

17 further processing. After digitization, the bit rate produced by the raw video can be very high. For example, the typical bit rate for a NTSC video is about 150 Mbps, which is too much for today’s typical network bandwidth. Therefore, compression techniques are necessary to be applied to digital videos before they can be further manipulated. Video compression is the key technology that makes the wide applications of digital video systems possible. As illustrated in Figure 1.1, after compression, the video is converted to the corresponding format that is appropriate for network transmission or storage on hard drives, disks, cassettes, etc. To date, various video compression techniques have been developed and we will discuss some of them in more details in the later chapters.

A/D

D/A

Data Pre-processing

Data Compression

Format Conversion

Data Post-processing

Data Decompression

De-formatting

Storage or Network Transmission

Figure 1.1 An example of a digital video system.

Encoder and decoder As can be seen from Figure 1.1, a digital video system usually includes two parts, the encoder and the decoder. The quality of a video is determined during the process of encoding. When decoding the video, the decoder reconstructs the video according to the data in the compressed video stream which indicates the quality of the video. A powerful encoder can carry out more smart strategies and complicate algorithms to achieve high

18 video quality while maintaining good compression ratio. Today, in the existing video coding standards, such as MPEG4 and H.264 [17-20], a lot of freedom is allowed for the encoding process, that is, the standards do not specify or define the encoding process in a rigid way. This allows tradeoffs between video compression ratio, frame quality, and encoder’s complexity. Depending on different applications, different systems are developed which provide different video quality. Some systems include only an encoder, such as online video broadcasting systems, while other systems may have only a decoder, for example a DVD player. For most of the systems, both encoder and decoder are required. A typical example of them is a video conference system. Figure 1.2 illustrates a block diagram of a video encoder based on MPEG-2 [21-23].

Coding Control Original picture sequence

Format Conversion

DCT

Entropy Encoding

Quantization

De-Quantization Motion Compensation

IDCT

MV Motion Estimation

Buffer of Coded Frames

Figure 1.2 Block diagram of a typical video encoder.

Bit stream of the compressed video

19 Motivation for video compression Although digital video signal is more immune to noise than analog signal, it usually requires large space for storage or wide bandwidth for transmission. With the current feature of network, this makes it impossible for real-time video communication or transmission. To reduce the requirement of space and bandwidth, video data need to be compressed to a fraction of its original size before storage or transmission. To achieve good performance of image or video compression, one way is to lose some of the relatively unimportant data which is not perceivable by the human visual system (HVS). A detailed introduction of the characteristics of HVS can be found in [24, 25]. 1.2

Challenges

In a video sequence, usually three types of redundancy exit: spatial redundancy, statistical redundancy and temporal redundancy. Video can be compressed by removing or reducing these redundancies. The redundancy in spatial domain is usually removed by transform coding, such as DCT or matching pursuit (MP). Entropy coding, e.g., variable length coding (VLC), is an effective approach to remove statistical redundancy. Temporal redundancy is caused by still or moving objects and background that exist in successive frames of a video sequence. Motion estimation (ME) [26, 27] and motion compensation (MC) [28, 29] techniques have been the most widely used methods in video compression and become the standard approach to reduce the temporal redundancies between frames. Motion estimation is a process which determines the motion between two or more frames. Motion compensation uses the motion information and a given reference frame to

20 reconstruct video frames. Motion compensation can only be performed when an estimate of motion is available. Because of the intensive computations and the large amount of resources required by motion estimation, it has been an active research field in the past two decades and various algorithms have been developed. Among them one group of methods that are most widely used today are block-based techniques, called block-matching algorithms (BMAs) [30-36]. In this technique, a frame is divided into blocks and the pixels on each block are assumed to have uniform motion. Each block is compared with candidate blocks in the reference frames within the search area to obtain a motion vector (MV). Till now a lot of research works have been done on developing fast efficient block-matching algorithms. They aim at reducing the computation of motion estimation as much as possible while maintaining the compensated frame quality as good as possible, thus improving the performance of the video codec as a whole. Besides the techniques mentioned above which use fixed-size blocks (16x16, 8x8, or 4x4) for motion estimation, another direction of research for video compression is to use variable-size block partition [37-42]. Recent research has shown that the compression performance of using variable-size blocks is better than that based on fixed-size blocks due to more accurate motion description being possible. This makes the techniques a promising candidate for future video applications. However, the computation and implementation complexity of these techniques is very high which prevents them from real-time applications. Thus how to reduce computation complexity or encoding time is still a main problem need to be solved.

21 1.3

Major Contributions

The research work described in this dissertation provides the following contributions. In this thesis work, first a new fast and efficient block-matching algorithm (BMA) for motion estimation and compensation is proposed. The new method uses a rood pattern with flexible rood arms for block-matching and the spatio-temporal correlation between the neighboring blocks is utilized to estimate the motion vectors for each block. Simulation results demonstrate that the search speed of the proposed algorithm is faster than the well known diamond search (DS)[34] and adaptive rood pattern search (ARPS) [35] algorithms in comparison with close or even better motion compensated image quality in terms of peak-signal-to-noise-ratio (PSNR). To enhance the motion estimation performance of block-matching methods, a new scheme of applying the Kalman filter to motion estimates is proposed by us. In our method, a first order autoregressive model is applied to the motion vectors (MVs) obtained by BMAs and the motion correlations between neighboring blocks are utilized to predict motion information. A new model is established by introducing a new error function and a novel scheme is proposed which adaptively adjusts the state parameters of the Kalman filter according to the statistics of the measured MVs. The experimental results show the proposed method produces more accurate motion vectors and better compensated image quality than the well known Kalman filtering methods in comparison. From the observation that under certain conditions splitting macro-blocks into smaller blocks may bring about better motion estimation, we propose a new adaptive

22 block partition algorithm which further improves the performance of the Kalman filtering based block-matching. In the new algorithm, a new model is established to evaluate the motion complexity and magnitude. According to an empirical rule developed by us, 16x16 macro-blocks are split into smaller (8x8 or 4x4) blocks adaptively and a zigzag scan order is applied to the split blocks for the Kalman filtering (KF). From our experimental results, the proposed method in combination with the KF method we proposed before, can effectively improve the ME performance in terms of the PSNR of the motion compensated images with smoother motion vector fields Motion estimation and compensation based on variable-size blocks has better compression performance than that of fixed-size blocks, thus having the potential to be applied to future applications. In this thesis work, by applying new fast algorithms proposed by us to a binary-tree block partition video encoder, we develop a fast adaptive binary-tree video coding system. In this system, to reduce the computation for the binarytree block motion estimation, an adaptive search area method is proposed which adjusts the size of the search area according to the size of the block. For large blocks, relatively large search areas are assigned to them, whereas for small blocks, relatively small search areas are used. An early termination algorithm proposed by us is adopted in the encoder for binary-tree block partition. By monitoring the statistic characters of the peak-signalto-noise-ratio (PSNR) values during each step of block splitting, the algorithm adaptively chooses the total number of blocks used for each frame and stops the binary tree growing process early. To reduce computation of the transform coding which uses the matching pursuit (MP) to encode the residual images resulting from motion compensation, we put

23 forward a new model for fast rate-distortion (R-D) cost estimation. By establishing an atom-rate model, the cost is estimated during each step of the MP encoding, thus avoiding redundant atoms used for MP encoding. Our simulation results show that by adopting the fast algorithms proposed by us, the encoding time of the new video encoder is reduced greatly with little or no sacrifice of frame quality. This makes the new adaptive coding system more appropriate for real time applications. 1.4

Dissertation Organization

The remainder of the dissertation is organized into the following chapters. Chapter 2 gives a theoretical analysis for motion estimation which describes how block-matching is carried out. A literature review of existing block-matching algorithms is given in this chapter. A fast and efficient block-matching algorithm based on rood pattern search and spatio-temporal correlation is introduced in detail. Chapter 3 gives a brief introduction to the Kalman filtering and describes a new way of applying adaptive Kalman filtering to block based motion estimation for performance improvement. Chapter 4 describes a Kalman filtering based block-matching algorithm with adaptive block partitioning. Chapter 5 describes a fast adaptive binary tree block video coding system with fast optimization algorithms applied to it. Chapter 6 gives the final conclusions and the future work.

24 CHAPTER 2: A NEW SPATIO-TEMPORAL CORRELATION BASED ROOD SEARCH FOR MOTION ESTIMATION 2.1

Representation of Transformations

A video is a sequence of still images or frames which are displayed successively with fast speed. In most cases, the change of intensity from frame to frame is caused by the motion of objects within the frame. Motion estimation [26, 27] is the process to find out the movement of objects in a video sequence. Usually the motion of an object between frames can be represented by the transformations such as translation, rotation, and scaling. Here, first we give a brief description of the mathematical representations of these transformations [43]. All transformations discussed here are expressed in a Cartesian coordinate system with three dimensions (3-D) x, y, and t.

y

y (x0, y0)

(x-1, y-1)

dy dx

t0

t-1

t

dt x x

Figure 2.1 An object with a motion of translation of (dx, dy) from t-1 to t0.

25 Translation To translate a point from the position ( x −1 , y −1 , t −1 ) to ( x0 , y 0 , t0 ) by using displacement (d x , d y , d t ) , the translation can be expressed as: x 0 = x −1 + d x

(2.1)

y 0 = y −1 + d y t0 = t −1 + d t

Here, ( x0 , y 0 , t0 ) is a point at position ( x0 , y 0 ) of a frame at time index t0, and ( x −1 , y −1 , t −1 ) is a point at ( x −1 , y −1 ) of a frame at time index t-1. dx, dy, and dt are the

displacements on the x, y and t axes, respectively. Figure 2.1 gives an illustration of an object with a translation of (dx, dy, dt) from time t-1 to t0. To rewrite equation (2.1) in matrix form, we get ⎡ x −1 ⎤ ⎡ x 0 ⎤ ⎡1 0 0 d x ⎤ ⎢ ⎥ ⎢ y ⎥ = ⎢0 1 0 d ⎥ ⋅ ⎢ y −1 ⎥ y⎥ ⎢ 0⎥ ⎢ ⎢t ⎥ ⎢⎣ t 0 ⎥⎦ ⎢⎣0 0 1 d t ⎥⎦ ⎢ −1 ⎥ ⎣1 ⎦

(2.2)

In order to develop a general form of transformation which includes translation, rotation, and scaling, square matrices are used to simplify the problem and equation (2.2) becomes ⎡ x 0 ⎤ ⎡1 ⎢ y ⎥ ⎢0 ⎢ 0⎥ = ⎢ ⎢ t 0 ⎥ ⎢0 ⎢ ⎥ ⎢ ⎣ 1 ⎦ ⎣0

0 0 d x ⎤ ⎡ x −1 ⎤ 1 0 d y ⎥ ⎢ y −1 ⎥ ⎥⋅⎢ ⎥ 0 1 d t ⎥ ⎢ t −1 ⎥ ⎥ ⎢ ⎥ 0 0 1⎦ ⎣ 1 ⎦

=>

v 0 = A ⋅ v −1

(2.3)

26 ⎡ x −1 ⎤ ⎡ x0 ⎤ ⎡1 ⎢y ⎥ ⎢y ⎥ ⎢ where v = ⎢ −1 ⎥ , v = ⎢ 0 ⎥ , and A = ⎢0 −1 0 ⎢ t −1 ⎥ ⎢ t0 ⎥ ⎢0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎣ 1 ⎦ ⎣1⎦ ⎣0

0 0 dx ⎤ 1 0 d y ⎥ are the original coordinates, the ⎥ 0 1 dt ⎥ ⎥ 0 0 1⎦

transformed coordinates, and the 4×4 transformation matrix, respectively.

y

y

(x-1, y-1)

(x0, y0)

t0

t-1

x

t

x

Figure 2.2 An object with a motion of scaling from t-1 to t0.

Scaling In video processing, each frame is a two dimensional (2-D) image and scaling is only considered on the x and y axes. Figure 2.2 illustrates an object with a motion of scaling from t-1 to t0. Scaling by factors Sx and Sy along the x and y axes is given by the transformation matrix ⎡s x ⎢0 S=⎢ ⎢0 ⎢ ⎣0

0 sy 0 0

0 0⎤ 0 0⎥ ⎥ 1 0⎥ ⎥ 0 1⎦

(2.4)

27 The transformation of rotation in three dimensions is more complex than translation and scaling. For example, to rotate a point A about another point R in the space requires the following steps: First, translate point R to the origin of the coordinate system and do the same translation to point A. Second, do the rotation of point A about the origin. Finally, translate the rotated point A back to its original coordinate system and obtain the translated position.

y

y

φ

(x-1, y-1)

(x0, y0)

t0

t-1

t

x x

Figure 2.3 An object with a motion of rotation by an angle φ about the t axis from t-1 to t0.

In video processing, since each frame is a two-dimensional (2-D) image, the rotation is only considered about the t axis. Figure 2.3 illustrates an object with a motion of rotation by an angle φ about the t axis from t-1 to t0. To rotate a point about the t axis by an angle φ, the corresponding transformation can be realized by the matrix in (2.5). ⎡ cos ϕ ⎢− sin ϕ Rϕ = ⎢ ⎢ 0 ⎢ ⎣ 0

sin ϕ cos ϕ 0 0

0 0⎤ 0 0⎥ ⎥ 1 0⎥ ⎥ 0 1⎦

(2.5)

26 ⎡ x −1 ⎤ ⎡ x0 ⎤ ⎡1 ⎢y ⎥ ⎢y ⎥ ⎢ where v = ⎢ −1 ⎥ , v = ⎢ 0 ⎥ , and A = ⎢0 −1 0 ⎢ t −1 ⎥ ⎢ t0 ⎥ ⎢0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎣ 1 ⎦ ⎣1⎦ ⎣0

0 0 dx ⎤ 1 0 d y ⎥ are the original coordinates, the ⎥ 0 1 dt ⎥ ⎥ 0 0 1⎦

transformed coordinates, and the 4×4 transformation matrix, respectively.

y

y

(x-1, y-1)

(x0, y0)

t0

t-1

x

t

x

Figure 2.2 An object with a motion of scaling from t-1 to t0.

Scaling In video processing, each frame is a two dimensional (2-D) image and scaling is only considered on the x and y axes. Figure 2.2 illustrates an object with a motion of scaling from t-1 to t0. Scaling by factors Sx and Sy along the x and y axes is given by the transformation matrix ⎡s x ⎢0 S=⎢ ⎢0 ⎢ ⎣0

Rotation

0 sy 0 0

0 0⎤ 0 0⎥ ⎥ 1 0⎥ ⎥ 0 1⎦

(2.4)

29 where V −1 = [v1−1 , v 2−1 , ... v n−1 ] is a 4×n matrix and v1−1 , v 2−1 , ... v n−1 denote the original coordinates of n points. The resulting matrix V0 = [v10 , v 20 , ... v n0 ] is a 4×n matrix. v i0 is the ith column of V0 . It is the transformed coordinates corresponding to v i−1 . To do transform in the opposite way of the transformations discussed above, inverse transformation matrices need to be obtained. For example, the inverse translation matrix A −1 and the inverse rotation matrix Rϕ−1 are 0 0 − dx ⎤ 1 0 −dy⎥ , ⎥ 0 1 − dt ⎥ ⎥ 0 0 1 ⎦

⎡1 ⎢0 A −1 = ⎢ ⎢0 ⎢ ⎣0

2.2

⎡ cos(−ϕ ) sin(−ϕ ) ⎢− sin(−ϕ ) cos(−ϕ ) Rϕ−1 = ⎢ 0 0 ⎢ ⎢ 0 0 ⎣

0 0⎤ 0 0⎥ ⎥ 1 0⎥ ⎥ 0 1⎦

(2.8)

Motion Estimation in Video Processing

In a video sequence, the general form of the intensity function of a frame at time t0 can be expressed as ⎡ R( x, y, t 0 ) ⎤ f ( x, y, t0 ) = ⎢G ( x, y , t 0 )⎥ ⎢ ⎥ ⎣⎢ B( x, y, t0 ) ⎦⎥



or

⎡Y ( x , y , t 0 ) ⎤ f ( x, y, t 0 ) = ⎢U ( x, y, t 0 )⎥ ⎢ ⎥ ⎣⎢V ( x, y, t0 ) ⎦⎥



(2.9)



Here f ( x0 , y 0 , t0 ) is a function which has three component functions of RGB color planes or YUV color planes. The transform between RGB and YUV color space can be found in →

[24]. At position ( x0 , y 0 ) the intensity value is f ( x0 , y 0 , t0 ) , which is a vector of three components R, G, and B or Y, U, and V. According to the discussion in the section →



before, the intensity of f ( x, y, t0 ) at position ( x0 , y 0 ) can be transformed from f ( x, y, t −1 ) by

30 →



f ( x 0 , y 0 , t0 ) = f ( x −1 , y −1 , t −1 )

(2.10)





Where f ( x, y, t0 ) and f ( x, y, t −1 ) denote the image intensity at time index t0 and t-1, →



respectively. f ( x, y, t0 ) and f ( x, y, t −1 ) are the intensity function for the current and past frame. From equation (2.6), the relation between position ( x0 , y 0 , t0 ) and ( x −1 , y −1 , t −1 ) is defined by ⎡ x0 ⎤ ⎡ x −1 ⎤ ⎡ x −1 ⎤ ⎢y ⎥ ⎢y ⎥ ⎢y ⎥ ⎢ 0 ⎥ = Rϕ ( S ( A ⋅ ⎢ −1 ⎥ )) = T ⋅ ⎢ −1 ⎥ ⎢ t0 ⎥ ⎢ t −1 ⎥ ⎢ t −1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣1⎦ ⎣ 1 ⎦ ⎣ 1 ⎦

(2.11)

If we consider a video stream as a 3-D volume in the (x, y, t) space, each frame can be regarded as a sample slice at discrete time points (…t-1, t0, t1, t2, …). The value of the sample rate or the time step Δt= ti-ti-1 depends on the frame rate of the video. For continuous motion, the following constraint condition must be satisfied. →



f − ( x −1 , y −1 , t 0 ) = f + ( x −1 , y −1 , t −1 )

Δt → 0

(2.12)

Here, superscript “-” and “+” denote before and after the corresponding time points. For the sake of simplicity, in most of the video processing research and applications, only the translational motion of objects is considered and in most cases only one color plane, for example the luminance Y in YUV color space, is used for motion estimation. The motion estimation results are then applied to other color planes. Thus we assume that f ( x, y, t 0 ) = f ( x − d x , y − d y , t −1 )

(2.13)

31 Here f(x, y, t) denotes the intensity function of Y color plane at time t. If we assume uniform motion between the frames, equation (2.13) can be rewritten as f ( x, y, t ) = f ( x − v x (t − t −1 ), y − v y (t − t −1 ), t −1 )

t −1 ≤ t ≤ t 0

(2.14)

where vx and vy are the speeds of the motion in horizontal and vertical direction. Under the assumption that motion is uniform translational within the spatiotemporal region, vx and vy can be related to ∂f ( x, y, t ) / ∂x , ∂f ( x, y, t ) / ∂y and ∂f ( x, y, t ) / ∂t by differentiating the equation (2.14) on both sides with regard to x, y and t. ∂ ( y − v y (t − t −1 )) ∂ ( x − v x (t − t −1 )) ∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y, t ) = ⋅ + ⋅ = ∂x ∂( x − v x (t − t −1 )) ∂x ∂ ( y − v y (t − t −1 )) ∂x ∂ ( x − v x (t − t −1 ))

(2.15)

∂( y − v y (t − t −1 )) ∂( x − v x (t − t −1 )) ∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y, t ) = ⋅ + ⋅ = ∂y ∂( x − v x (t − t −1 )) ∂y ∂( y − v y (t − t −1 )) ∂y ∂ ( y − v y (t − t −1 ))

(2.16)

∂ ( y − v y (t − t −1 )) ∂ ( x − v x (t − t −1 )) ∂f ( x, y, t ) ∂f ( x, y , t ) ∂f ( x, y , t ) = ⋅ + ⋅ ∂t ∂ ( x − v x (t − t −1 )) ∂t ∂ ( y − v y (t − t −1 )) ∂t ∂f ( x, y , t ) ∂f ( x, y , t ) = ⋅ ( −v x ) + ⋅ ( −v y ) ∂ ( x − v x (t − t −1 )) ∂ ( y − v y (t − t −1 ))

(2.17)

From equations (2.15) to (2.17), we obtain the following equation: vx ⋅

∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y, t ) + vy ⋅ + =0 ∂x ∂y ∂t

(2.18)

Equation (2.18) is called a spatio-temporal constraint equation. It can be generalized to solve motions of other types, such as scaling/zooming. In the derivation of the equations above, we assume that motion is only translational for equation (2.13) and with uniform speed for equation (2.14). Thus we don’t consider the motions such as rotation, scaling/zooming, regions covered/uncovered by object motion, or multiple objects with different speed (vx, vy). However, if we assume uniform translational motion only for small local regions and calculate the motion

32 components (vx, vy) at each small region or at each pixel, equation (2.18) can be applied to the object regions that do have uniform motion or background regions that are not affected by object motion. Such kinds of region take a considerable percentage of typical video frames. Based on equations (2.14) and (2.18), there are two mainstream techniques for motion estimation: region matching algorithms based on (2.14) and spatio-temporal constraint approaches based on equation (2.18). 2.2.1

Region Matching Methods

Region matching methods assume that all the pixels within a region have the same motion activity. For a given region in the current frame, the methods search for the displacement of the best matching region in the reference frame. The methods estimate motion on the basis of region and produce one displacement vector (dx, dy) for each region. The displacement vector is also called motion vector (vx, vy) which is obtained by minimizing a error function defined as Error =

∫∫ E ( f ( x, y, t

0

), f ( x − d x , y − d y , t −1 ) )dxdy

(2.19)

( x , y )∈R

The region which minimizes the error function (2.19) is the best matching region and the corresponding displacement vector (dx, dy) is the motion vector (vx, vy) of the region. Here, E(.) denotes a metric for calculating the difference between the two regions. The two most widely used metrics are absolute difference and squared difference as shown in equations (2.20) and (2.21). Error =

∫∫

f ( x, y, t 0 ) − f ( x − d x , y − d y , t −1 ) dxdy

( x , y )∈R

(2.20)

33 Error =

∫∫ [ f ( x, y, t

) − f ( x − d x , y − d y , t −1 )] dxdy 2

0

(2.21)

( x , y )∈R

The problem of minimizing the equations (2.20) or (2.21) is nonlinear. The approaches for solving this nonlinear problem can be grouped into two main classes: block-matching methods and recursive methods. Block-matching algorithms (BMAs) Block-based motion estimation is a typical type of region matching method which uses rectangular blocks for motion estimation. It is widely used in current video processing applications for motion estimation. BMAs assume that all the pixels within a block have the same motion activity. The methods estimate motion on the basis of blocks and produce one motion vector for each block. In general, BMAs are more suitable for a simple hardware realization because of their regularity and simplicity. We will discuss BMAs in more details in later sections. Recursive algorithms (RAs) Recursive methods [44, 45] are iterative refining of motion estimation for individual pixels by gradient methods. We use ( (d~x (n ), d~y (n )) to denote the estimate of ( d x , d y ) at the nth step. Thus the estimation of ( d x , d y ) at the (n+1)th step is obtained by ~ ~ d x (n + 1) = d x (n ) + Δ x ( n )

(2.22)

~ ~ d y (n + 1) = d y ( n ) + Δ y ( n )

(2.23)

where Δx (n) and Δ y (n ) are the update terms. According to different descent method used, the update terms may be different. Here we use the steepest descent method to give an illustration. Thus equations (2.22) and (2.23) become

34 ∂E ( d x , d y ) ~ ~ d x ( n + 1) = d x ( n ) − s ⋅ ∂d x

∂E ( d x , d y ) ~ ~ d y ( n + 1) = d y ( n ) − s ⋅ ∂d y

(2.24)

~ ~ ( d x ( n ), d y ( n ))

(2.25)

~ ~ ( d x ( n ), d y ( n ))

where s is a adjustable step size and E ( d x , d y ) is the error function as in equation (2.19).

It is a function of variables (d x , d y ) for the current region. In recursive methods, (d x , d y ) can be estimated at an arbitrary location (x, y) by using spatial interpolation techniques. A typical group of methods which estimate ( d x , d y ) at each pixel are called pel-recursive algorithms (PRAs). Comparing with BMAs,

PRAs involve more computational complexity and less regularity, thus they are more difficult to realize in hardware. 2.2.2 Methods Based on Spatio-Temporal Constraint This group of algorithms [46, 47] is developed based on the spatio-temporal constraint equation (2.18). Assuming that ∂f ( x, y, t ) / ∂x , ∂f ( x, y, t ) / ∂y and ∂f ( x, y, t ) / ∂t are given, equation (2.18) can be regard as a linear equation with two unknown variables v x and v y . An over-completed set of linear equations can be obtained by calculating ∂f ( x, y, t ) / ∂x , ∂f ( x, y , t ) / ∂y , and ∂f ( x, y , t ) / ∂t at positions ( xi , y i , ti ) for 1 ≤ i ≤ n , at which

v x and v y are assumed constant, vx ⋅

∂f ( x, y, t ) ∂x

( xi , y i , ti )

+ vy ⋅

∂f ( x, y , t ) ∂y

( x i , y i , ti )

+

∂f ( x, y , t ) ∂t

( xi , y i , ti )

≈0

1≤ i ≤ n

(2.26)

The error function is ⎡ ∂f ( x, y, t ) Error = ∑ ⎢v x ⋅ ∂x i =1 ⎣ N

( xi , y i , ti )

∂f ( x, y , t ) + vy ⋅ ∂y

( xi , y i , ti )

∂f ( x, y , t ) + ∂t

⎤ ( xi , y i ,t i ) ⎥ ⎦

2

(2.27)

35 To generalize it to a local spatio-temporal region ℜ , the error function becomes 2

⎛ ∂f ( x, y, t ) ∂f ( x, y, t ) ∂f ( x, y , t ) ⎞ Error = ∫∫∫ ⎜⎜ v x ⋅ + vy ⋅ + ⎟⎟ dxdydt ∂x ∂y ∂t ⎠ ( x , y ,t )∈ℜ ⎝

(2.28)

To solve the error function (2.28) which is a quadratic form of two unknown variables, the solution to it involves solving two linear equations 2 ⎧ ⎛ ∂f ( x, y, t ) ∂f ( x, y, t ) ⎞ ⎛ ∂f ( x, y, t ) ∂f ( x, y, t ) ⎞ ⎛ ∂f ( x, y, t ) ⎞ ⎟⎟dxdydt = − ∫∫∫ ⎜ ⋅ ⋅ ⎟dxdydt ⎟ dxdydt + v y ⋅ ∫∫∫ ⎜⎜ ⎪v x ⋅ ∫∫∫ ⎜ ∂y ∂t ∂x ∂x ∂x ⎠ ⎠ ⎪ ⎠ ( x , y , t )∈ℜ ⎝ ( x , y , t )∈ℜ ⎝ ( x , y , t )∈ℜ ⎝ (2.29) ⎨ 2 ⎛ ⎞ ⎛ ⎛ ⎞ ⎞ ∂ ∂ ∂ ∂ ∂ ( , , ) ( , , ) ( , , ) ( , , ) ( , , ) f x y t f x y t f x y t f x y t f x y t ⎪v ⋅ ⎜ ⎟dxdydt + v y ⋅ ∫∫∫ ⎜⎜ ⎟ dxdydt = − ∫∫∫ ⎜⎜ ⎟dxdydt ⋅ ⋅ x ∫∫∫ ⎜ ⎟ ⎟ ⎟ ⎪ ∂y ∂x ∂y ∂t ∂y ⎠ ⎠ ⎠ ( x , y , t )∈ℜ ⎝ ( x , y , t )∈ℜ ⎝ ( x , y , t )∈ℜ ⎝ ⎩

Notice that there may be more than one solution to equations in (2.29). In the case when f(x,y,t) is constant within the region ℜ , ∂f ( x, y, t ) / ∂x , ∂f ( x, y, t ) / ∂y , ∂f ( x, y, t ) / ∂t are zeros. Thus v x and v y can be any values and the true motion cannot be estimate. In the case when f(x,y,t) is a perfect edge, the motion parallel to the edge will not affect the value of f(x,y,t) and cannot be estimate. To solve the linear equations in (2.29), partial derivatives ∂f ( x, y, t ) / ∂x , ∂f ( x, y, t ) / ∂y , ∂f ( x, y , t ) / ∂t need to be calculated at arbitrary spatio-temporal locations,

which makes the computation mathematically more expensive than

block-matching

methods. 2.3

Block-Matching for Motion Estimation and Compensation 2.3.1 Block-Matching and Motion Estimation

Block-based algorithms are the most popular methods for motion estimation and have been applied to most of video applications. In block-matching, a frame is divided into an array of ‘macro-blocks’ (MBs). Each macro-block has the size of 16x16 and is

36 then compared with the candidate blocks in the reference frame. The candidate macroblock that results in the least cost is the one that matches closest to the current block. Typically, two measurements, sum of absolute differences (SAD) and sum of squared differences (SSD) are adopted to evaluate how closely a candidate macro-block matches the current one. SAD =

∑C

− R( i + u , j + v )

∑ (C

− R( i + u , j + v ) )

( i , j )∈current

SSD =

( i , j )∈current

(2.30)

ij block

2

ij block

(2.31)

where Cij and R(i+u, j+v) are the pixels being compared in the current macro-block and the macro-block on the reference frame, respectively. Usually only the luminance plane is computed for SAD or SSD. Some video compression standards limit the maximum number of bits to encode each motion vector, thus restricting a motion vector’s magnitude and its horizontal and vertical components’ maximum value. In such case, the maximum value of distance between a macro-block and its candidate reference blocks is also limited. Usually, motion estimation is carried out only within a region of the reference frame, which is called the “search area”. This also reduces the amount of computation for motion estimation. The search for the best matching macro-block is within a search area whose size is decided by the search parameter ‘p’. And the search range is up to p pixels on all four sides of the corresponding macro-block in the reference frame. Figure 2.5 demonstrates a block-matching with search parameter p=P. The square in gray is the search area for

37 block-matching. Usually, faster motions require a larger p value. The larger the search parameter, the more computationally intensive the process of motion estimation becomes.

R C

MV

2P+1

C: Current block

R: Block in the reference frame

Figure 2.5 Block-matching with search parameter p = P.

The most direct way to perform motion estimation is to exhaustively check every possible candidate macro-block within the search area on the reference frame, and chose the best matching one. This method is called full search block-matching algorithm (FSBMA). Figure 2.6 shows two different types of error surface resulting from full search. The ideal error surface of block-matching is a unimodal function with one global minimum as shown in Figure 2.6 (a). However such surface rarely exists in the real world. In most of the cases error surface is a non-unimodal function with multiple local minima as shown in Figure 2.6 (b).

Sum of absolute differences(SAD)

38

15000 10000 5000 0 20 20

10

10 0

0 -10 y

-10 -20

x

-20

Sum of absolute differences(SAD)

(a) Unimodal error surface with a global minimum.

6000 4000 2000 0 20

20 10

10 0

0 -10 y

-10 -20

-20

x

(b) Non-unimodal error surface with multiple local minima.

Figure 2.6 Examples of matching error surface.

39 After block-matching, a motion vector (MV) is obtained for each macro-block. The motion vector is the displacements from the location of the current macro-block to the location of the best matching macro-block on the reference frame. Variable length coding (VLC) techniques [48, 49] are usually used here to encode the MVs and generate bits for the video bit stream. MVs are used in motion compensation to construct the motion compensated images. The difference between the original macro-block and the best-matching block is the prediction error which is usually encoded using the techniques that are used for compressing still-images. Notice that the reference frame is not necessarily the frame displayed before the current frame. Sometime, multiple reference frames are used. For example, if two reference frames are used: one frame before the current frame and one frame after the current frame in the display order but encoded previously. Thus the block matching is implemented on both reference frames. And the best matching-block is the one that has the least error among the candidate blocks on both reference frames. However, if a frame is decoded with error, all the frames that use it as the reference frame will be affected and decoded wrongly, thus the error propagates. To avoid such problem, one kind of video frame “I” frame is used. This type of frame doesn’t use reference frames for encoding and is encoded by itself as a still-image. In the case when a frame is decoded not correctly, the error propagation will stop at the next I frame and the frames after that I frame in the encoding order will not be affected. Besides I frames, there are other two types of frames, “P frames” and “B frames.” P frames use only a previously displayed frame as the reference frame. B frames use

40 frames both in future and previous position in the display order as the reference frames. Figure 2.7 gives an example sequence of video frames.

Bidirectional Prediction

I1

B2

B3

P4

B5

Frame display order : I1 B2 B3 P4

B6

B5 B6

Frame coding order : I1 P4 B2 B3 P7

B5

P7

P7 B6

Types of frames: I (Intra) frame P (Predicted) frames B (Bidirectional) frames

Figure 2.7 An example sequence of MPEG frames and the inter-frame dependencies.

In modern video compression standards, motion estimation with non-integer accuracy is allowed. By doing interpolation, the motion of a macro-block between the current frame and the reference frame can be estimated at the resolution accuracy of 1/2 or 1/4 of a pixel. 2.3.2 Block-Based Motion Compensation When decoding a video, motion compensation is carried out. The process uses the reference frames and the motion vectors to reconstruct each macro-block of the current frame. For the motion vectors that have components of integer values, the predicted macro-block is a simple copy of the matching-block in the reference frame. For the

41 motion vectors that have components of non-integer values, interpolation is used to estimate the macro-block at non-integer locations. After obtaining the prediction of each macro-block, the prediction of the whole frame is also obtained. The prediction error is then decoded and added to the frame, and the final motion compensated frame is reconstructed. To evaluate the quality of a reconstructed image, a popular metric is mean squared error (MSE) [50], which is the sum of the squared error between the motion compensated image and the original one as given by MSE =

1 MN

M

N

∑∑ ( I ( x, y ) − I ( x, y )) '

2

(2.32)

y =1 x =1

Here N and M are the number of rows and columns of pixels of the frame, respectively. I(x, y) and I'(x, y) are the values of the intensity of a pixel at position (x, y) in the original image and motion compensated picture, respectively. Another widely used metric for comparing various image compression techniques is the peak-signal-to-noise-ratio (PSNR). The measurement evaluates the image quality based on the root of MSE of the reconstructed frame. The mathematical formulae for PSNR is ⎛ ( peak to peak value of original data )2 ⎞ ⎛ I2 ⎞ ⎟ = 10 log10 ⋅ ⎜⎜ max ⎟⎟ PSNR = 10 log10 ⎜⎜ ⎟ mean squared error ⎝ MSE ⎠ ⎝ ⎠

(2.33)

where Imax is the maximum possible value of the pixels on the image. When 8 bits sample precision is used, the value of Imax is 255. The higher the value of PSNR, the better the quality of the compensated image.

42 Reference frame 2

1 5 9 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

13

7

6

Motion compensated frame

4

3

11 10 14

15

8

12 16

Figure 2.8 An illustration of motion compensation.

Figure 2.8 illustrates the typical procedure of motion compensation. The computation requirement for motion compensation is much less than that of motion estimation. For each macro-block, motion estimation must calculate SAD or SSD on a number of 16x16-pixel blocks, whereas motion compensation just does the simple copy or interpolation of the selected matching block. This difference is critical and makes video decoding a much computationally simple process than video encoding. The ratio of the computation time taken by motion compensation to the whole decoding process can vary greatly due to the difference of the content of video sequence, the standard used for video compression, and the implementation of the decoder. The value usually ranges from 5% to 40%. 2.3.3 Block-Matching Algorithms From the discussion in the previous sections, it is easy to see that if we reduce the number of locations needs to be checked for ME, the computation can be reduced greatly. For example, we only check a small number of selected locations instead of very possible

43 location within the search area for ME. However, the computation speed is improved by sacrificing the quality of motion compensated image and compression performance. It is easy to see that with more available candidate blocks, it’s more possible to find a 16x16pixel region in the reference frame that has a better match with the current macro-block, thus reducing the prediction error. Therefore, on the average, increasing the number of checking location decreases the prediction error, which means that fewer bits are needed to encode it, or with the same number of bits, it can be encoded more precisely. In most video compression standards, only the format of the final compressed video stream and the steps need to decode it are defined. The process of encoding is unspecified. Thus a variety of approaches can be employed in encoders for motion estimation and how to estimate motion becomes the largest differentiator among various video encoding systems. The choice of technique for motion estimation can significantly affect the computational performance and the quality of the finally compressed video. Some video compression standards, such as H.264 [17], allow each macro block to be further split into sub-blocks with the size of 4x4, 4x8, 8x8, or 8x16. In such cases, for each sub-block a separate motion vector is computed and the quality of the final compensated image is usually improved. However, this also incurs more computation for ME, since instead of one MV, two or four motion vectors are computed for each macroblock. Accordingly, more bits need to be transmitted for MVs. Therefore, there is a tradeoff between the bit rate and motion estimation performance.

44 In the succeeding sections, we will give a brief review of some well known block matching algorithms. A brief discussion of computation complexity for some algorithms is also given. 2.3.3.1

Full Search

The simplest motion estimation method is full search (FS) which checks every possible 16×16-pixel region within the search area and selects the best matching block with the least error. FSBMA can produce the optimal block-matching and motion estimation results among the existing block-matching algorithms. However, it suffers from very high computation requirement in its implementation. Let us use an example to illustrate the computational load of ME for full search. Consider a typical search area of 33×33-pixel, which contains 324 possible candidate macro-blocks. Here we only do the computation on the luminance plane. To calculate SAD for one macro-block, the computation includes 256 (-), 256 (|.|), and 255 (+). Thus the total ME computation for a frame using full search is (256+256+255)×324=248508 arithmetic operations. For a video with a common frame rate of 30 and CIF resolution, the arithmetic operations for ME is more than 2.95 billion per second, which is a very high value. The high computational load of full search makes it a bottleneck for its application in practical implementations of motion estimation. Instead, fast motion estimation algorithms were developed which select only a small number of candidate blocks for block-matching.

45 There are various fast block-matching algorithms [30-36, 51-53], of which some well known algorithms are three-step search (TSS) [30], four-step search (4SS) [33], diamond search (DS) [34], adaptive rood pattern search (ARPS) [35], etc. Here, we give a brief introduction of some of them. 2.3.3.2

Three-Step Search

Three-step search (TSS) [30] is a fast search algorithm which consists of three steps. Each step uses a fixed search pattern of nine uniformly spaced search points. In the first step, the step size is set to 4. The one giving the least cost is chosen and made the new search center for the next step search. The size of the search pattern is reduced by half and the search points get closer after every step. The algorithm halts in three steps. Hence, TSS requires a fixed (9+8+8) = 25 search points for each block which reduces computation by a factor of 10 or even larger of FS depending on the value of p. Figure 2.9 depicts a general process of TSS.

x

y

: First Step

: Second step

: Third step

Figure 2.9 Three-step search procedure for the motion vector (1,-7).

46 2.3.3.3

Four-Step Search

Four-step search (4SS) [33], as is name denotes, is carried out in four steps. In the first step, the search step size is set to 2 and 9 locations are checked. If the best-matching location is at the center of the search pattern, the search procedure jumps to the fourth step. Otherwise, the search center is moved to the least-cost location and the search is repeated for two more steps. In the last or the fourth step of search, the step size is reduced to 1 and the location with the least cost is the best matching macro-block. Figure 2.10 illustrates a general search procedure of 4SS.

x

y

: Step 1

: Step 2

: Step 3

: Step 4

Figure 2.10 Four-step search procedure for the motion vector (3,-5).

2.3.3.4

Diamond Search

Diamond search (DS) [34] algorithm is employed in a similar way as 4SS except that it uses a diamond search pattern instead of a square. Moreover, the number of steps that the algorithm can take is not limited. Two different fixed-size patterns are utilized in

47 DS, namely, large diamond search pattern (LDSP) and small diamond search pattern (SDSP). LDSP is utilized first until the least cost is found at the center of the pattern. Then SDSP is employed in the last step. DS algorithm requires less search points and computation than TSS method. It has been incorporated into the MPEG-4 verification model [54]. An illustration of the DS search procedure is given in Figure 2.11.

x

y : Step 1(LDSP) : Step 3(LDSP)

: Step 2(LDSP) : Step 4(SDSP)

Figure 2.11 Diamond search procedure for the motion vector (4,-2).

2.3.3.5

Adaptive Rood Pattern Search

Adaptive rood pattern search (ARPS) [35] adopts a rood-shape search pattern and the size of the rood arms can be adaptively adjusted during the search procedure. The motion vector of the immediate left neighboring block is used to predict the motion vector for the current block. Two sequential search stages are carried out for blockmatching. In the first search stage, the size of the rood is decided according to the

48 predicted motion vector and the rood search in this stage is performed only once. The least cost point is found and made the center of the search in the next stage. In the second search stage, a rood pattern with unit-size is exploited repeatedly until the point with minimum error is found to be at the center of the search pattern. The main advantage of ARPS over DS is its saving on computation by directly putting the search in an area where there is a high probability of finding a good matching block. According to [35], ARPS can increase the computational gain over DS by a factor ranging from 1.9 to 3.4 with little reduction in frame quality.

Predicted MV x

y : First stage search

: Second stage search

Figure 2.12 Adaptive rood pattern search: The predicted motion vector is (3, -1) and the rood arm length is 4.

49 2.4

A New Block-matching Method using Rood Pattern and Spatio-Temporal Correlation 2.4.1 Introduction In the previous sections, we have reviewed a few popular block-matching

algorithms. Each of them achieves different tradeoff between algorithm complexity and block-matching performance. In this section, we describe a fast efficient block-matching algorithm [55] for motion estimation (ME). The performance of motion estimation is improved by using an adaptive rood search pattern which utilizes the spatio-temporal correlation between the macro-blocks (MBs) for motion prediction. The block-matching is carried out in two stages. In the first stage or the initial search stage, a new scheme is proposed that utilizes the temporal and spatial motion correlation between the blocks to choose the candidate neighbouring block for motion prediction. In addition, based on the available motion vectors (MVs) of the neighbouring macro-blocks, a more flexible strategy is devised which adjusts the lengths of the horizontal and vertical rood arms adaptively and separately. After the initial search, a fixed-size rood pattern is employed repeatedly until the best matching block is found. Experimental results show that the proposed method has better performance than the well known diamond search (DS) and adaptive rood pattern search (ARPS) algorithms in terms of the search speed and the peak-signal-to-noise-ratio (PSNR) of the motion compensated images. For simplicity, in the remaining parts of the thesis, the algorithm developed in this chapter is denoted as ARS-ST which means adaptive rood search with spatio-temporal correlation.

50 2.4.2 Candidate Neighbouring Block Selection and MV Prediction To fully exploit inter-block spatio-temporal correlation for motion estimation, a more flexible and efficient solution to suboptimal block-based inter-frame motion estimation is developed here. Moreover, simplicity and feasibility are also major issues we need to consider for algorithm development. From testing a variety of video sequences, we observed the following facts. For slow motions between the video frames, a small-size search pattern is more effective for motion estimation than a one of large-size. This is because in this case the magnitudes of the components of MV are small and only the small region around the search window center is necessary to be checked. A large search pattern incurs unnecessary computation and is more likely to be trapped into the local minima along the search path. On the other hand, for fast motions, a large size search pattern performs better than a small one at finding MV with large magnitude. As a conclusion, a search strategy which can adaptively adjust the search pattern size according to different scenario is expected to have a good block-matching performance.

… …… … …… … … i-M … … … … … i-2 i-1 i

(a)

(m-1, (m-1, (m-1, n-1) n) n+1) (m, n-1) (m,n)

(b)

iC iD iE iB i

(c)

Figure 2.13 The current block i and its neighboring blocks.

51 Temporal and spatial motion correlations between the blocks are used in our work for MV prediction of the current block. In the spatial domain, because the blocks in a frame are scanned and processed in a raster order, the available neighbouring blocks for the current block are the blocks to the left of the current block and the blocks on the upper rows of the current row as shown in Figure 2.13(a). A linear predictor [56] is used here and the MV for the current block is predicted by using the MV measurements of the previous blocks. The predicted motion vector is calculated by: ~r M r v = ∑ a i v n −i

(2.34)

i =1

r ~r r Here v is the estimate of the current block motion v , and vn −i is the motion

vector obtained by block-matching at the i-th previous index in the raster scan order. ai , i=1,2,…,M are the coefficients of the linear predictor which has the order of M. The error of the prediction is r ~r ev =| v − v |

(2.35)

r Here, v is the motion vector obtained by block-matching algorithm. It is assumed

to be the real motion of the block. The optimal value of the coefficients ai can be obtained by solving the Yule-Walker equations [57]: M

∑a i =1

i ,opt

R( j − i ) = R( j )

(2.36)

r where R(j) denotes the autocorrelation function of the motion vectors. In this case, vi is a

vector which has two components ( vi , x , v i , y ) . Since in most of the cases, the

52 r characteristics of the two components ( vi , x , v i , y ) of motion vector vi are independent, the

two components are computed separately. For example, for the component x, N

R(j) = ∑ v m , x ∗ v m + j , x . By solving equation (2.36), the resulting coefficients ai ,opt are m =1

~r ~r obtained and used for calculating the x component of v . The component y of v is

obtained in the same way. Thus the whole predicted vector is obtained. From our tests, we also observed that adjacent blocks usually have the highest motion correlation. This is because they have the highest probability to be on the same object, thus having the same or similar motion. In order to reduce computation, in our work only the blocks adjacent to the current block are selected for motion prediction. Since the blocks in a frame are scanned and processed in a raster order in spatial domain, for the blocks whose index is in the future position in the raster scan order than the current block, the MVs of them are not available and can not be used for the MV prediction. The available neighboring blocks for the current block are the immediate left, immediate top, top-left corner, and top-right corner blocks. The support region for the prediction is indicated by the gray areas in Figure 2.13(b). Thus the equation (2.34) is changed to ~r r r r r v = a m ,n −1v m ,n −1 + a m −1,n −1v m −1,n −1 + a m −1,n v m −1,n + a m −1,n +1v m −1,n +1

(2.37)

If we use iB, iC, iD, and iE as shown in Figure 2.13(c) to denote the corresponding blocks in Figure 2.13(b), the equation (2.37) can be changed to one dimensional form ~r r r r r v = a B v B + a C vC + a D v D + a E v E

(2.38)

53 To get the optimal values of the coefficients a B , aC , a D , and a E , we need to solve the Yule-Walker equations in (2.36), which involves O(M3) mathematic operations by classical algorithms or O(M2) mathematic operations by Levinson’s algorithm [58]. The computation is very intensive and complex considering the number of blocks in each frame, thus it is not feasible for real application. In our work, an alternative simple and effective method is adopted. Instead of solving the Yule-Walker equations, we select the r r r r most promising motion vector from the candidate motion vectors v B , vC , v D , and v E , and

use it as the prediction for the MV of the current block.

c b

d

e

a

C B

D A

E

Reference Frame Current Frame

Figure 2.14 The current block “A” and its neighbouring blocks “B”, “C”, “D”, and “E” in spatial domain. “a” , “b” “c”, “d”, and “e” are the corresponding blocks in the reference frame.

Now the problem changes to how to select the most promising motion vector which is the closest description of the motion for the current block. Temporal and spatial motion correlations between the blocks are used in our work for choosing the best

54 candidate MV. In Figure 2.14, the current block is marked with “A” and its neighbouring blocks are marked with “B”, “C”, “D”, and “E”, respectively. Hence the problem becomes how to choose the block from the four candidate blocks which has the best MV prediction for the current block. It is easy to see that the block which has the highest motion correlation with the current block is the best one for MV prediction. In video processing we assume that the motion between frames is continuous and temporal inter-frame correlation is utilized in our work to solve the problem. From our observation and test results, in the spatial domain of the present frame, for the adjacent blocks that have high motion correlation, their corresponding blocks in the reference image usually also have high motion correlation, vice versa. This is due to the motion continuity and the temporal correlation between the frames. For block “A” and its neighbouring blocks “B”, “C”, “D”, and “E” in the current frame, their corresponding blocks are marked with “a”, “b”, “c”, “d”, and “e” in the reference frame, respectively, as shown in Figure 2.14. Thus we can use the corresponding blocks in the reference frame to estimate the motion correlation between the current block “A” and its neighbouring blocks in the current frame. In our work, the Euclidean distance between the MVs of the two blocks is selected as a measure of the motion correlation between them. The Euclidean distance between two MVs of two blocks 1 and 2 is defined as d12 =

(MV1x − MV2 x )2 + (MV1 y − MV2 y )2

(2.39)

where MV1x, MV2x, MV1y, and MV2y are the horizontal and vertical components of the MVs of blocks 1 and 2, respectively. The measured value d12 denotes the Euclidean distance between the MVs of the blocks 1 and 2, and is used to evaluate the motion

55 correlation between the two blocks. In our work, the block whose corresponding block in the reference frame has the least Euclidean distance of MV with the block “a” is selected as the best candidate block. Its MV is used as the predicted MV for the current block “A” in the present frame. The procedure for predicting the MV for the current block can be summarized by the following pseudo code. Step1: Calculate dab, dac, dad, dae using equation (2.39) Setp2: if dab = Min{dab, dac, dad, dae}, then aB = 1, aC = aD = aE = 0 elseif dac = Min{dab, dac, dad, dae}, then aC = 1, aB = aD = aE = 0 elseif dad = Min{dab, dac, dad, dae}, then aD = 1, aB = aC = aE = 0 else aE = 1, aB=aC=aD=0 end r r r r MV predicted = a B v B + aC vC + a D v D + a E v E

where dab, dac, dad, and dae are the Euclidean distance between the block “a” and its neighbouring blocks “b”, “c”, “d”, and “e”, respectively. Because the Euclidean distance calculation involves square and square-root operations, to reduce the computational complexity, a new variable defined in (2.40) is used instead in our work.

56

(

~ d12 = MV1x − MV2 x + MV1 y − MV2 y

Here

~ d 12 is

)

(2.40)

the sum of the absolute difference between the components of the MVs

of the blocks 1 and 2, and is used in this research for correlation comparison. The computation of

~ d 12 requires

only two absolute operations and three plus/minus operations.

To further reduce the computation and simplify the algorithm, less candidate blocks are preferred for the correlation comparison. By testing various video sequences, among the four candidate blocks “B”, “C”, “D”, and “E”, block “B” and “D” usually have more motion correlation with block “A” than the other two blocks and are more possibly to be selected as the best candidate neighbouring blocks for MV prediction. This is probably because the horizontal and vertical motions are the most common motions in video frames. Thus the algorithm can be further simplified as if ( MVax − MVbx + MVay − MVby ) ≤ ( MVax − MVdx + MVay − MVdy ), then MVpredicted= [MVBx, MVBy]T else MVpredicted= [MVDx, MVDy]T end where MVax, MVay, MVbx, MVby, MVdx, and MVdy are the horizontal and vertical components of the MVs of blocks “a”, “b”, and “d”, respectively. Notice that for the first row and the first column of macro-blocks in the frame, since blocks “B” and/or “D” are not available, the algorithm mentioned above is not applicable. For the macro-block located at the left-up corner of the frame, since it is the first block in the raster scan order of the macro-blocks for the current frame, no MV prediction is made to it.

57 For each block on the first row, the MV is predicted from its immediate left neighbouring block “B” as defined in equation (2.41). MVpredicted= [MVBx, MVBy]T

(2.41)

For each block on the first column, the MV is estimated from its immediate top neighbouring block “D” as defined in equation (2.42). MVpredicted= [MVDx, MVDy]T The prediction of MV is illustrated in Figure 2.15.

Input MVb and MVd of the blocks in the previous frame

Calculate d~ab and d~ad Find the minimum value of { d~ab , d~ad } and record the corresponding position index i, i= b, or d.

Set the values for the coefficients: if i = b, aB=1; the other coefficient is set to 0 iif i = d, aD=1; the other coefficient is set to 0

Compute and output the predicted MV v~r = a B vrB + a D vrD

Figure 2.15 Block diagram for motion vector prediction.

(2.42)

58 2.4.3 Search Pattern and Strategy

From the discussion at the beginning of this section, we can see that to decide the size of search pattern, we need to have an estimation of the magnitude of the MV for the current block before block-matching. Since adjacent blocks belonging to the same object in the frame usually have similar motions, the size of the pattern can be decided by exploiting the spatial inter-block correlation within the current frame. In our work, the block-matching is carried out in two stages and two search patterns are adopted in each stage. A simple and efficient rood pattern is proposed for the first stage or the initial search stage, and the length of the horizontal and vertical rood arms are determined adaptively and separately. This search pattern aims for placing the search origin for the following steps at the position close to the global minimum of error surface, thus reducing unnecessary intermediate search and the risk of being trapped into local minima. In the second stage, a fixed-size diamond search is adopted and utilized repeatedly until the least cost is found at the center of the search pattern. A. Adaptive rood pattern initial search with predicted MV

A rood shape pattern is selected for block-matching in our work which is similar to the pattern used in the ARPS [35]. This is based on the observation that usually the MV distribution in horizontal and vertical directions is higher than those in other directions. The search points on the rood pattern are depicted in Figure 2.16 by the circle marks. The rood has four arms between the center point and four vertex points, two horizontal arms and two vertical arms. In the method of ARPS [35], the rood is symmetrical and the lengths of the four arms are equal. To make the search pattern more

59 flexible, in our work, the horizontal and vertical arms are adjusted separately as defined in equation (2.43) and (2.44).

Rh = Max {MV Bx , MV Dx }

{

R v = Max MV By , MV Dy

}

(2.43) (2.44)

Here, Rh is the length of the two horizontal arms and Rv is the length of the two vertical arms. MVBx, MVDx, MVBy, and MVDy are the horizontal and vertical components of the MVs of blocks “B” and “D”, respectively. Since block “B” and/or “D” are not available for the blocks on the first row or first column, they are processed in the following way. For the first block on the first row, a fixed size of 2 pixels is used for both Rh and Rv. For other blocks on the first row, Rh and Rv have the same value and are decided from the MV components of their immediate left neighbouring blocks using

{

Rh = Rv = Max MV Bx , MV By

}

(2.45)

For the blocks on the first column, Rh and Rv have the same value and are decided from the MV components of their immediate top neighbouring blocks using

{

Rh = Rv = Max MV Dx , MV Dy

}

(2.46)

Besides the 5 points on the center and vertexes of the rood, the point indicated by the predicted MV also needs to be examined, which has a high possibility of closing to the real MV. As illustrated in Figure 2.16, there are totally six points that need to be checked during the initial search stage. The one giving the least cost is set as the center for the search in the next stage. B. Fixed-size rood pattern local search

60 A small diamond search pattern (SDSP) [34] is chosen in our method for local search. The four arms of the rood are equal with unit size as depicted in Figure 2.16 by the triangle marks. This fixed pattern is employed repetitively in an unrestricted way. After each search, the point with the least matching error is made the center of the rood for the next search. If the least error is found at the center of the rood, the search stops and the MV is obtained which is indicated by the location of the least error point.

Rh

Predicted MV

x

Rv

y

: Adaptive rood pattern initial search : Fixed-size local search

Figure 2.16 New adaptive rood pattern search: The predicted motion vector is (-2,-1) and the rood arm length Rh = 3, Rv = 2.

2.4.4 Summary of ARS-ST Method

For each MB, the proposed ARS-ST method can be summarized in the following sequential steps: Step1: Select the best candidate block using the temporal and spatial correlation

and predict MV for the current macro-block.

61 Step2: Calculate the sizes of the rood arms Rh and Rv and do the initial search.

Move the search center to the least error location. Step3: Do the local search using the fixed size rood pattern. After each search,

move the search center to the least error point and repeat this step. The searching procedure stops when the minimum error point is found at the center of the rood which indicates the MV of the current block. 2.4.5 Experimental Results

In the experiments, we compare the searching speed and the PSNR performance between the proposed method and two well known fast BMAs, namely DS and ARPS. A set of standard image sequences are used for performance evaluation. The video sequences and their formats are listed in Table 2.1 and Table 2.2. Sample frames of the test sequences can be found in appendix B. The component Y of every sequence is computed to demonstrate the effectiveness of the proposed method. SAD defined in equation (2.30) is selected as the cost function for block-matching and peak-signal-tonoise-ratio (PSNR) measure defined in (2.47) is chosen to evaluate the compensated image quality.

⎡ 2552 ⎤ PSNR= 10log10 ⎢ ⎥ ⎣ MSE⎦

(2.47)

In the experiments, we assume all the frames can be exactly reconstructed from the compensated and residual images, and frame compression is lossless. Video frames are coded using “IPPPP…” structure. Each P frame uses its immediate pervious frame as the reference image. Motion vectors and respective motion compensation are calculated with pixel accuracy. The search parameter p is set to p=16.

62 The computation performance of block-matching is evaluated by the number of search locations per macro-block. Four algorithms FS, DS, ARPS, and ARS-ST are compared in the experimental test and the average search points for the video sequences with their first 50 frames are listed in Table 2.1. During block-matching, for the blocks on the boundaries of a frame, the points within the search window but outside the frame area are not checked. Thus the number of search points for FS are smaller than the theoretical value which is (2p+1)2 = (16×2+1) 2 = 1089. From Table 2.1 we can see that the proposed algorithm consistently requires the least search points, thus has the fastest search speed among the algorithms in comparison.

Table 2.1 Average number of search points for each macro-block Video

FS

DS

ARPS

ARS-ST

Carphone (CIF)

984.92 15.2095 8.0919

7.4075

Mobile(CIF)

984.92 13.5085 7.2651

6.123

Foreman (CIF)

984.92 15.7467 8.3007

7.4947

Bus(CIF)

984.92 21.0612 10.3185

8.937

Mother&daughter(CIF) 984.92 13.4134 6.3099

6.0849

Akiyo(CIF)

984.92 12.2686 5.049

4.872

Flower Garden(CIF)

984.92 15.7711 8.4215

7.2365

News(CIF)

984.92 12.523 5.3679

5.1781

Tennis(QCIF)

886.0101 13.4697 7.1794

6.7521

Football(SIF)

973.703 17.657 10.1951

9.6081

63 Table 2.2 Computational gain of ARS-ST over FS, DS, and ARPS Video

ARS-ST

ARS-ST

ARS-ST

to FS

to DS

to ARPS

Carphone(CIF)

132.96

2.05

1.09

Mobile(CIF)

160.86

2.21

1.19

Foreman (CIF)

131.42

2.10

1.11

Bus(CIF)

110.21

2.36

1.15

Mother&daughter(CIF)

161.86

2.20

1.04

Akiyo(CIF)

202.16

2.52

1.04

Flower Garden(CIF)

136.10

2.18

1.16

News(QCIF)

190.21

2.42

1.04

Tennis(SIF)

131.22

3.0

1.06

Football

101.34

1.84

1.06

The computation gains of the proposed algorithm over FS, DS, and ARPS are listed in Table 2.2. As can be deduced from Table 2.2, the proposed algorithm has the fastest search speed for block-matching. The average speed up of ARS-ST over DS is about 2.2 with the maximum value of 3 and the average speed up of ARS-ST over ARPS is about 9.4% with the maximum value of 19%, which is very impressive considering both DS and ARPS very efficient search methods. From Table 2.2 we can also see that for the video sequences containing fast motions, such as “Mobile” and “Flower Garden”, the developed method is more effective than for the sequences that containing slow motions, such as “Akiyo” and “News”.

64 The quality of the compensated images is evaluated in term of PSNR for the first 50 P frames of the test video sequences and the results are listed in Table 2.3. As indicated from Table 2.3, the proposed method has a very close performance to that of DS and ARPS when the sequence contains only small motions, such as “Akiyo”. The described method performs better than the other two fast algorithms when the sequence has large motions, such as “Bus” and “football”. For the overall average PSNR performance of all the test video sequences, the proposed method has 0.23 dB improvement over DS and is 0.18 dB better than ARPS.

Table 2.3 Average PSNR(dB) performance Video

FS

DS

ARPS

ARS-ST

Carphone(CIF)

34.0105 33.5552 33.5028

33.5783

Mobile(CIF)

24.0055 23.8275

23.869

23.8592

Foreman (CIF)

32.627

32.383

32.4018

Bus(CIF)

24.0055 21.2803 22.3718

23.7071

Mother&daughter(CIF)

41.6916 41.6457 41.6072

41.6009

Akiyo(CIF)

43.0544 43.0379 43.0124

43.0151

Flower Garden(CIF)

26.3678 26.3233 26.3321

26.3311

News(CIF)

38.5737 38.4463

38.383

38.3943

Tennis(QCIF)

31.0136 29.7339 29.1686

29.4686

Football(SIF)

22.8253 21.7538 21.7716

21.868

32.3275

65 Figure 2.17 and Figure 2.18 show the PSNR plots of the first 50 P frames of the video sequence “Foreman” and “Bus” with search parameter p = 16. As shown in Figure 2.17 and Figure 2.18, for sequence “Foreman”, ARS-ST has close performance to DS and ARPS. For sequence “Bus”, the algorithm developed herein has a better PSNR performance than DS and ARPS. The PSNR plots in Figure 2.17 and Figure 2.18 verify the results listed in Table 2.2 and Table 2.3.

Full search Diamond search Adaptive rood pattern search The proposed mehtod 35 34

PSNR[dB]

33 32 31 30 29

0

10

20 30 Frame index of "Foreman (CIF)"

40

Figure 2.17 PSNR performance of “Foreman.”

50

66 Full search Diamond search Adaptive rood pattern search The proposed mehtod 26

PSNR[dB]

24 22 20 18 16 0

10

20 30 Frame index of "Bus (CIF)"

40

Figure 2.18 PSNR performance of “Bus.”

50

67 CHAPTER 3: BLOCK BASED MOTION ESTIMATION USING ADAPTIVE KALMAN FILTERING 3.1

Introduction of the Kalman filter 3.1.1 Background Introduction

In 1960, Rudolf E. Kalman published his work [59] which describes a recursive method for solving the discrete linear filtering problem. The filter is named after his name and called the Kalman filter. From then on, with the help of the fast development in digital techniques, a lot of study and research have been done on the Kalman filter. The original formulation of the Kalman filter is now named the simple Kalman filter. The Kalman-Schmidt filter or the extended Kalman filter [60, 61] was later developed by Stanley Schmidt who applied the linear Kalman filter to solving the problem of trajectory estimation. The information filter and various types of square-root filters were also developed later [62, 63]. Nowadays, the Kalman filter has been applied to many fields and applications, such as satellite navigation systems, control system, localization/mapping, speech enhancement, etc. A typical example of the use of the Kalman filter is phase-locked loop, which is indispensable in many telecommunications equipments. In the Kalman filtering, by minimizing the mean of the squared error, a set of mathematical equations are computed in a recursive way to estimate a process’s state. Thus, the Kalman filter can be regarded as an optimal recursive realization of the leastsquares method. The Kalman filtering is an efficient and robust method for estimating a

68 signal. It can estimate the states in the past, present, and even in the future. For those modeled systems that the exact system feature is unknown, the filter often works quite well. In this section, we provide a simple and brief introduction to the discrete Kalman filter and give a brief derivation of the mathematical form for the basic discrete Kalman filter. 3.1.2 Signal Estimation and Discrete Wiener Filter Signal estimation

To estimate the value of an unknown parameter from a group of observations or measurements is required in many applications. Due to the existence of noise, such as additive noise, multiple noise, sensor imperfection, model inaccuracy, etc., the measurement data is usually noise-corrupted. The problem of signal estimation can be defined as: x[n] : The measurement at sample time index n x = ( x[0], x[1], x[2],... x[ N − 1]) T : Vector which has N components of measurements

p(x;θ): The mathematical form of the model which is characterized by the parameter θ. The problem of signal estimation is how to construct a function/filter of the vector x which can produce an estimate of θ.

(

θˆ = g x = ( x[0], x[1], x[2],... x[ N − 1])T

)

where θˆ is the estimate of θ and g(x) is called the estimation function.

(3.1)

69 A natural optimal criterion of estimator g(x) is to minimize the mean squared error (MSE) which is given by MSE (θˆ) = E[(θˆ − θ ) 2 ] = E{[(θˆ − E (θˆ)) + ( E (θˆ) − θ )] 2 } = var(θˆ) + [ E (θˆ) − θ ] 2

(3.2)

The problem of signal estimation is to design the estimator/filter which has the least value of MSE (θˆ)

x(t)=s(t)+n(t)

s(t)

y (t ) = sˆ(t + α )

Linear Filter

Delay -α

e(t)=s(t+α)-y(t)

Figure 3.1 Block diagram for signal estimation.

The block diagram for signal estimation is illustrated in Figure 3.1. In Figure 3.1, x(t) is the input of the linear filter. s(t) and n(t) are the signal and noise, respectively. The output of the filter is the estimation of the input signal s(t):

y (t ) = sˆ(t + α )

(3.3)

e(t) = s(t+α)-y(t)

(3.4)

The estimation error is

The design of linear filter is to select the filter h(t)/H(ω) which has the minimum mean squared error of estimation. The functions of linear filter can be classified as: α0 - Prediction : Based on the current and previous signal input, the ideal output of the linear filter is the actual signal at a time point after the current time. Such kind of signal estimation is called prediction. Discrete Wiener filter The conditions and requirements for the Wiener filtering [64]: (1) The input x(t) is the sum of signal s(t) and additive noise n(t). It is assumed that s(t) and n(t) are joint wide-sense stationary (WSS) with known auto-correlation and cross-correlation functions (or the corresponding spectral density function). (2) The filter is time invariant with transform function h(t)—H(ω). (3) The output is a wide-sense stationary random process, which means steadystate filtering. Theoretically, it can be assumed that the input signal x(t) is added to the filter at time t = -∞. Thus at any time t, the output y(t) should be wide-sense stationary. (4) Select the filter h(t)/H(ω) which can minimize the mean squared error of estimation. The fundamental concepts for Wiener filter can be summarized as how to weight the history signal inputs to obtain the best estimation of the signal for the current time. To sample the input signal x(t) = s(t) + n(t), the value of x at time ti is xi: x i = x (t i ) = s(t i ) + n (t i ) = si + n i ,

i = 1,.... N

(3.5)

71 The output at time tN is yN, which is the weighted linear combination of the input signals before the current time. y N = k1 x1 + k 2 x2 + ... + k N x N

(3.6)

where k1, k2, …, kN are the weights. The estimation error without time delay is: eN = s N − y N = s N − ( k1 x1 + k 2 x 2 + ... + k N x N )

(3.7)

The mean squared error is:

[ ] {[s − ( k x + k x + ... = E [s ] + k E [x ] + k E [x ] + ... E e N2 = E 2 N

N

2 1

1

2 1

1

2

2 2

2

2 2

+ k N x N )]

2

}

[ ]

+ k E x 2 N

2 N

+ 2 k 1 k 2 E [x 1 x 2 ] + 2 k 1 k 3 E [x 1 x 3 ] + ... + 2 k 1 k N E [x 1 x N + 2 k 2 k 3 E [x 2 x 3 ] + 2 k 2 k 4 E [x 2 x 4 ] + .. + + −

... + 2 k N − 1 k N E [x N − 1 x N ] {2 k 1 E [x 1 s N ] + 2 k 2 E [x 2 s N ] +

] 2 k 2 k N E [x 2 x N ]

... + 2 k N E [x N s N

(3.8)

]}

The aim of Wiener filter is to obtain the weight sequence k1, k2, …, kN which can

[ ]

[ ]

minimize E eN2 . Let the partial derivations of E eN2 with respect to k1, k2, …, kN equal to zero. Then we get a group of N joint functions of k1, k2, …, kN, as shown in equation (3.9). We assume that both the auto-correlation and cross-correlation functions of the signal and noise are known. Then the expectations in the equation (3.9) are also known. Thus the function group in equation (3.9) can be solved to obtain the values of k1, k2, …, kN. ⎡ E[ x12 ] E [ x1 x2 ] ⎢ E [ x22 ] ⎢ E[ x2 x1 ] ⎢ M M ⎢ ⎣ E[ x N x1 ] E[ x N x 2 ]

L E[ x1 x N ]⎤ ⎡ k1 ⎤ ⎡ E[ x1s N ] ⎤ ⎥ ⎥ L E[ x2 x N ]⎥ ⎢⎢ k 2 ⎥⎥ ⎢⎢ E [ x 2 s N ] ⎥ = ⎥⎢ M ⎥ ⎢ ⎥ M O M ⎥⎢ ⎥ ⎢ ⎥ 2 L E [ x N ] ⎦ ⎣k N ⎦ ⎣ E[ x N s N ]⎦

(3.9)

72 When the sample number N is small, for non-real time applications, the Wiener filter is an effective filtering method. However, for real-time applications, as the number of sample N increases, the computation increases considerably. For example, when N -> (N+1), the covariance matrix changes from N×N to (N+1)×(N+1), meaning N+1 joint functions need to be solved. At the same time, both old and new input data are equal important and need to be stored. Thus they can not be processed in real-time. In addition, the Wiener filter is difficult to be applied to multiple variables estimation. 3.1.3 Discrete Kalman Filter The Wiener filter is to obtain the filter’s impulse response which can minimize the mean squared error of estimation. In other words, is how to weight the history input in the best way to generate the current output. The weights are the filter’s impulse response. The Kalman filter is a breakthrough to the Wiener filter which uses state-space model to replace correlation function and uses differential equations in time domain to represent filtering problem. The Kalman filter is a recursive algorithm which is suitable for real time processing. The Kalman filtering (KF) has two main features: y

It introduces the state-space model for stochastic processes. The model is represented in the matrix forms, thus allowing estimating multiple variables at the same time.

y

It adopts a recursive method to process the measured data, thus in favor of real-time processing.

73 In this subsection, we describe the filter in its original form. In this form, the measurements are measured at discrete time points and state is estimated accordingly. The mathematical derivation can be found in various resources [59, 65]. System state model Here, we give a brief description of the model of the process that the Kalman filter is supposed to deal with. We use x = [ x1 , x 2 ,..., x N ]T to denote the process state vector. The Kalman filter is used to estimate the state vector x ∈ Γ N of a process which is given by the linear equation ⎡ x&1 ⎤ ⎡ a11 a12 ⎢ x& ⎥ ⎢ a a ⎢ 2 ⎥ = ⎢ 21 22 ⎢M⎥ ⎢ M M ⎢& ⎥ ⎢ ⎣xN ⎦ ⎣aN1 aN 2

L a1N ⎤ ⎡ x1 ⎤ ⎡ b11 b12 L a2N ⎥ ⎢ x2 ⎥ ⎢ b21 b22 ⎥ ⎢ ⎥+⎢ O M ⎥ ⎢M⎥ ⎢ M M ⎥ ⎢ ⎥ ⎢ L aNN ⎦ ⎣xN ⎦ ⎣bN1 bN 2

L b1M ⎤ ⎡ u1 ⎤ L b2M ⎥ ⎢ u2 ⎥ ⎥ ⎢ ⎥ O M ⎥ ⎢M⎥ ⎥ ⎢ ⎥ L bNM ⎦ ⎣uM ⎦

(3.10)

The output function is:

⎡ y 1 ⎤ ⎡ c11 ⎢ y ⎥ ⎢c ⎢ 2 ⎥ = ⎢ 21 ⎢ M ⎥ ⎢ M ⎢ ⎥ ⎢ ⎣ y L ⎦ ⎣ c L1

c12 c 22 M cL2

L L O L

c1 N ⎤ ⎡ x 1 ⎤ c2 N ⎥ ⎢ x2 ⎥ ⎥⎢ ⎥ M ⎥⎢ M ⎥ ⎥⎢ ⎥ c LN ⎦ ⎣ x N ⎦

(3.11)

The above equation can be written in a simple form as: x& = A ⋅ x + B ⋅ u y = C ⋅x

(3.12)

where x = N × 1, u = M × 1, y = L × 1, A = N × N, B = N × M, and C = L × N. In the equation above, the state transition matrix is denoted by A with the size of N×N. The state variable x at the previous time index k-1 is related to the state x& at the

74 current time index k by using matrix A. The u ∈ Γ M is the optional input at time k-1. It is related to the state x by the N×M matrix B.

u(t)

Shaping filter H(p)=Sy(p)

y(t)

Figure 3.2 Block diagram for the Kalman filter state model.

In Figure 3.2 u is the driving noise vector with M components. Each component is a random white noise. In this model, we can consider that y is the result when a white noise is the input. The state variable x can be considered as the intermediate variable. Here, we use a simple one dimensional state equation to derive the solution for the state function in time domain. x& = ax + bu

(3.13)

dx ( t ) = ax ( t ) + bu ( t ) dt

(3.14)

Multiply both sides of the equation (3.14) with e − at , we get e − at

dx ( t ) − ae − at x ( t ) = be − at u ( t )) dt

d − at [ e x ( t )] = be − at u ( t ) dt

(3.15)

(3.16)

Do integration on both sides of the equation above. e − at x ( t )

t t0

=



t

t0

be − a τ u (τ )d τ

(3.17)

75 e − at x ( t ) − e − at 0 x ( t 0 ) =



t

be − a τ u (τ )d τ

t0

(3.18)

t

x (t ) = e a ( t −t0 ) x (t0 ) + e at ∫ be −aτ u(τ )dτ =e

a ( t −t0 )

t0

t

(3.19)

x (t0 ) + ∫ be a ( t −τ ) u(τ )dτ t0

Rewrite the equation derived in (3.19) in matrix form, we get x& = A ⋅ x + B ⋅ u

(3.20)

and t

x(t ) = e A( t −t0 ) x(t0 ) + ∫ e A( t −τ ) B(τ )u(τ )dτ

(3.21)

t0

The equations (3.20) and (3.21) are called state transition equation. Matrix e At and A are of the same size, and are called state transition matrixes. Symbol Φ(t) is used here to denote them. Thus we have: t

x(t ) = Φ(t − t0 ) x(t0 ) + ∫ Φ(t −τ )B(τ )u(τ )dτ

(3.22)

t0

We rewrite the equation (3.12) here:

⎧ x(k + 1) = Φ(k + 1, k ) x(k ) + w(k ), where ⎨ y (k ) = C ( k ) x(k ) ⎩

K ≥0

(3.23)

x ( k + 1) = x ( t K +1 ), ⎧ ⎪ x (k ) = x (tK ) ⎪ where ⎨ Φ ( k + 1, k ) = Φ ( t K +1 − t K ) ⎪ t K +1 ⎪ w ( k ) = ∫ Φ ( t K +1 − τ ) B (τ ) u (τ )d τ tK ⎩

The recursive Kalman filter was first put forward in the form of discrete time domain. The Kalman filter in continuous form was derived later.

76 The procedures for the Kalman filter The discrete Kalman Filter includes three procedures: the prediction procedure, the observation procedure, and the estimation procedure. a. The prediction model is the state model discussed before. x ( k ) = Φ ⋅ x ( k − 1) + w ( k )

(3.24)

Here, x ( k ) ∈ Γ N is the process state vector at time step k. The random variable w(k) represents the process noise vector at time step k. b. The observation model at time step k is given by z(k ) = H ⋅ x (k ) + v (k )

(3.25)

z (k ) − observatoin at time t k(Mx1),

where v(k ) − observatoin noise vector(Mx1), H − observatoin matrix(MxN) , the tranform

for z (k ) ∝ x(k ) when

v(k ) = 0

Here, z ( k ) ∈ Γ M is the measurement/observation at time k. The process state vector x(k) is related to the measurement z(k) by the m×n matrix H. H is called the measurement matrix. The random variable v(k) represents the process noise vector. Notice that both A and H are assume constant here. However, in real world they might vary during each step. The random variables w(k) and v(k) in equation (3.24) and (3.25) are assumed to have white and Gaussian distributions and to be independent of each other. The covariance matrix of the process noise is denoted with Q and the covariance matrix of the measurement noise it denoted with R. Both the values of Q and R are known. Notice that Q and R are assumed constant here, although they might change from time to time during the process. Thus, we have

p( w) ~ N (0, Q )

(3.26)

77

p(v) ~ N (0, R) ⎧Q , i = k E [ w ( k ) w (i ) T ] = ⎨ ⎩ 0, i ≠ k ⎧R,i = k E [v ( k )v (i ) T ] = ⎨ ⎩ 0, i ≠ k

(3.27)

(3.28)

E [ w ( k ) v ( i ) T ] = 0 ,.. ∀ i , k

c. The estimation model

xˆ − ( k ) : The prediction for x(k) before the time tK. e − (k ) = x (k ) − xˆ − (k ) : The prediction error (with zero mean).

The superscripts “-” and “+” here denotes “before” and “after” the measurement, respectively. The covariance matrix for the prediction error is

P − ( K ) = E[e − (k )e− (k )T ] = E[(x(k ) − xˆ − (k ))(x(k ) − xˆ − (k ))T ]

(3.29)

For the Kalman filtering, in general we assume that the initial values for xˆ − ( k ) and

P− (K ) are known. However, if in practice we can not calculate the value of xˆ − ( k ) and its mean is zero, we can let xˆ − ( k ) =0 and P− (k ) = E[ x − (k ) x − (k )T ] . After having the value of xˆ − ( k ) , we can use the measurement vector z(k) at time tK to improve the estimation for x(k). Same as the Wiener filtering, the Kalman filtering also uses the mean squared error metric and the Kalman gain matrix is defined as

K ( k ) = P − ( k ) H T ( k )[ H ( k ) P − ( k ) H T ( k ) + R ( k )]−1

(3.30)

78 The physical meaning of the Kalman gain can be understood from the relative changes

of

each

variable.

When

the

measurement

noise

increases,

this

means E [ v ( k ) v ( i )] = R ( k ) ↑ and it will lead K ( k ) ↓ .

The estimation at time step k is xˆ ( k ) , which is also called the updated estimation.

xˆ + ( k ) = xˆ − ( k ) + K ( k )[ z ( k ) − H ( k ) xˆ − ( k )]

(3.31)

where the first item on the right side of the equation is the prediction at time index k. The second item is the new information obtained from the measurement z(k) at time index k. The updated error covariance matrix is

P + (k ) = [ I − K (k ) H (k )]P − (k )

(3.32)

Operation of the Kalman filtering From the discussion above it can be seen that a feedback control mechanism is used to estimate the process state. The process state is estimated at a time point and then feedback is obtained by the means of measurement. Accordingly, the Kalman filter equations are classified into two sub-groups: The equations used for calculating the predictions for the next time point by projecting forward the state and error covariance of the current time point. And the equations for feedback which refine and obtain a post estimate by incorporating the new measurement. This process can also be considered as a prediction – correction process. Figure 3.3 gives an illustration of the operation of the predictor-corrector process in the Kalman filter.

79 Measurement zk

Predict/ Time Update

Correct/ Measurement Update

feedback

Figure 3.3 Predictor-corrector cycle in a discrete Kalman filter.

The equations for time update in the discrete Kalman filtering are listed below.

xˆ − ( k ) = Axˆ + ( k − 1) + w( k )

(3.33)

P − (k ) = AP + (k − 1) AT + Q(k − 1)

(3.34)

We can see that in equations (3.33) and (3.34), the state and covariance are estimated by projecting forward from time index k-1 to time k. The equations for measurement update in the discrete Kalman filtering are the equations (3.30)-(3.32) discussed before. The Kalman gain K is computed first in the procedure of measurement update using (3.30). Then the observation value z(k) is measured which is used in equation (3.31) to generate the post state estimate. Finally, equation (3.32) is used to calculate the post error covariance. The two groups of equations constitute a cycle and the process is carried out recursively. The post estimates in the previous cycle are used to predict the new preestimates in the current cycle. Unlike Wiener filter which operates directly on all of the available data for each estimate, the recursive operation form of the Kalman filter is more feasible to be implemented practically.

80 As discussed above, a Kalman filtering (KF) operation consists of two consecutive stages; prediction and correction/updating. The sequential steps for the Kalman filtering algorithm are summarized below. Initial conditions:

xˆ − (0) = E[ x(0)] , P − (0) = E[(x(0) − xˆ − (0))(x(0) − xˆ − (0))T ] Prediction: Step1: Predict the state vector:

xˆ − ( k ) = Axˆ + ( k − 1) + w( k ) Step2: Compute the prediction error covariance:

P − (k ) = AP + (k − 1) AT + Q(k − 1) Correction/Updating: Step3: Compute the Kalman gain:

K ( k ) = P − ( k ) H T ( k )[ H ( k ) P − ( k ) H T ( k ) + R ( k )] −1 Step4: Update state estimate using measurement z(k):

xˆ + ( k ) = xˆ − ( k ) + K ( k )[ z ( k ) − H ( k ) xˆ − ( k )] Step5: Update prediction error covariance:

P + (k ) = [ I − K (k ) H (k )]P − (k ) Stpe6: K=k+1, go to the step1 for the next filtering cycle.

Figure 3.4 is the block diagram of the operation and steps of the filter, which combines the diagram of Figure 3.3 with the equations in (3.30)-(3.34).

81

Prediction State prediction:

xˆ − ( k ) = A( k − 1) xˆ + ( k − 1) + w( k ) Prediction-error covariance:

P − (k ) = A(k − 1) P + ( k − 1) AT (k − 1) + Q (k − 1) k=k+1

Initial estimates Correction/Updating

for

Compute the Kalman gain: −



K ( k ) = P ( k ) H ( k )[ H ( k ) P ( k ) H ( k ) + R ( k )] T

T

xˆ − (0)

and

P − ( 0)

−1

Update estimate with measurement Zk:

xˆ + ( k ) = xˆ − ( k ) + K ( k )[ z ( k ) − H ( k ) xˆ − ( k )]

z(k)

Updating Prediction-error covariance:

P + (k ) = [ I − K (k ) H (k )]P − (k )

xˆ + ( k )

Figure 3.4 Operation of the Kalman filter.

3.2

Block Based Motion Estimation using Adaptive Kalman Filtering 3.2.1 Background Overview

Among various block-matching algorithms, full search (FS) yields the best motion estimation result. However, it requires exhaustive checking of the entire search area and obtains the optimal motion estimate at the cost of intensive computation. To reduce the computational burden for motion estimation, many fast search algorithms such as three-step search (TSS) [30], four-step search (4SS) [33], adaptive rood pattern search

82 (ARPS) [35], and diamond search (DS) [34] have been developed in the last two decades. However, due to the non unimodal characteristics of error surface, one critical problem for the fast search algorithms is that they often converge to a local optimum, thus resulting in worse motion estimation performance than FS algorithm. To improve the performance of BMAs, the Kalman filter was introduced to refine the MVs obtained from conventional fast BMAs [66-70]. Methods based on non-adaptive and adaptive Kalman filtering (KF) were proposed in [70] to enhance the motion estimates resulting from TSS. For the non-adaptive Kalman filtering, the state parameters’ values are pre-assigned and fixed during the process, whereas for the adaptive Kalman filtering, the values of the state parameters can be time-varying during the process. The flexibility of the adaptive KF makes it over performs the non-adaptive KF in most cases. An 8x8-block based motion estimation algorithm was introduced in [68] which uses three-step search algorithm and the adaptive Kalman filtering technique developed in [70] for more accurate motion estimates. The non-adaptive Kalman filtering was used in [69] to enhance the motion estimation results for video coding with low bit rate. Besides motion estimation, KF can also be utilized for other video applications, such as video stabilization [71]. In our work, we develop a new adaptive scheme for applying the Kalman filter to motion estimation. In our proposed method, the state parameters of the Kalman filter are updated during the filtering process according to the statistics of the MV measurements. Comparing with the Kalman filtering methods developed in [70], our method produces more accurate motion vectors (MVs) and better compensated images in terms of PSNR.

83 For simplicity, in the remaining parts of this chapter, non-adaptive KF is referred to as NKF. The adaptive KF method developed in [70] is referred to as AKF. The new adaptive KF scheme developed by us is referred to as New-AKF. 3.2.2 Introduction In the remaining parts of this chapter, we present a new adaptive Kalman filtering approach [72] to improve the performance of block-based motion estimation. In our work, measured motion vectors are obtained by a conventional block-matching algorithm (BMA). Autoregressive models of 1-D and 2-D are used to model the motion correlation between neighboring blocks, thus obtaining the motion prediction. To improve the performance of the Kalman filtering, a new method is proposed for adaptively adjusting the state parameters of the Kalman filter during each iteration according to the statistics of the motion vector (MV) measurements. The results of our test indicate the effectiveness of the proposed approach in terms of peak-signal-to-noise-ratio (PSNR) of the motion compensated images with smoother motion vector fields. Furthermore, since the Kalman filtering is independent of block-matching method, it can be applied to any block-matching algorithm for motion estimation performance improvement with little modification. Another benefit from the Kalman filtering is that after the process, the motion vectors have fraction pixel accuracy which requires no additional bit rate. 3.2.3 Motion Modeling and State-Space Representation 1) 1-D causal AR model and state-space representation In our work, we adopt the first order autoregressive (AR) [73] model to characterizes the MV correlation between the blocks. This is based on the following

84 consideration. In block-matching, the matching process is performed block by block in a raster order, from top row to bottom row and from left side to right side on each row. In most cases, the motion of a block has the highest correlation with its adjacent blocks and it is reasonable to assume that the motion of a block is only influenced by its adjacent blocks. Due to the scan order of the block-matching process and for the sake of the simplicity in implementation, we only choose the immediate-left neighboring block as the model support. The effectiveness of the first order AR model is proved by our tests. The MV of a macro-block MB(m,n) is defined as V(m,n)=[vx(m,n),vy(m,n)]T, where (m,n) is the index of the current block in the macro-block array of the frame and the superscript “T” denotes matrix transposition. vx(m,n) and vy(m,n) are the components in horizontal and vertical directions, respectively. We assume these two components are independent of each other. The measured MV of a macro-block is obtained by a BMA, for example DS or ARPS, running on a 16x16 macro-block basis. The MVs of MBs are scanned in a raster order to generate an MV sequence for the Kalman filtering and MV prediction. In this way, the MV of a macro-block is predicted from the previous MV measurement in the MV sequence or its left neighbor in the frame by using the state equation of the Kalman filter: V ( k ) = A( k )V ( k − 1) + W ( k )

(3.35)

where W(k) = [wx(k),wy(k)]T is the vector for state noise. It is also called the prediction error and its covariance matrix is denoted by Q. V(k) is the predicted motion vector. k is the index of the MV in the MV sequence. The two components wx(k) and wy(k) of the state noise are assumed independent of each other. Their probability distributions are

85 assumed to be Gaussian with zero-mean and variances qx(k), qy(k), respectively. The transition matrix is denoted by A(k). The measurement equation of the Kalman filtering is given by Z (k ) = H (k ) ⋅ V (k ) + N (k )

(3.36)

where N(k) = [nx(k),ny(k)]T is the vector for the measurement noise. Its covariance matrix is denoted by R. The two components nx(k) and ny(k) of the measurement noise are assumed independent of each other. They are assumed to have Gaussian distributions with zero mean and variance rx(k), ry(k) as well. Based on the new measurement Z(k) (obtained by using a conventional block-matching algorithm) at time index k, the Kalman filter updates the prediction of V(k) which is obtained with the previous measurement Z(k-1). Moreover, in this work, the measurements are obtained by the fast search algorithm, in which the macro-blocks are processed independently. Thus, we can assume that the measurement error is independent of each other. In the Kalman filtering, the motion vector of a macro-block is predicted from the previous MV measurement in the scan order or its left neighbor. The 1-D AR model is defined as v x ( m, n ) = a1 ⋅ v x ( m, n − 1) + w x ( m, n )

(3.37)

v y ( m, n ) = b1 ⋅ v y ( m, n − 1) + w y ( m, n )

(3.38)

where wx ( m, n ) ∼ N ( 0, q x ( m, n )) and w y ( m, n ) ∼ N ( 0, q y ( m, n )) are the components of the prediction error in horizontal and vertical directions. They are assumed independent. Their distributions are assumed to be Gaussian with zero-mean and variances qx(m, n) and qy(m,n), respectively. Replacing the spatial location (m, n) and (m, n-1) with the time

86 index k and k-1 in the raster scan order, and reorganize equations (3.37) and (3.38) into matrix form, we obtain the state equation of the Kalman filter as V ( k ) = A ⋅ V ( k − 1) + W ( k )

(3.39)

⎛ v x (k ) ⎞ ⎛ w (k ) ⎞ ⎛a 0 ⎞ ⎟⎟ W ( k ) = ⎜⎜ x ⎟⎟ and A = ⎜⎜ 1 where V ( k ) = ⎜⎜ ⎟⎟ . , , ⎝ 0 b1 ⎠ ⎝ v y (k ) ⎠ ⎝ w y (k ) ⎠

Here, V(k) = [vx(m,n), vy(m,n)]T = [vx(k), vy(k)]T is the prediction of MV. W(k) = [wx(m,n), wy(m,n)]T = [wx(k), wy(k)]T is the vector for state noise and its covariance matrix is denoted by Q. Accordingly, the variances of the components wx(k) and wy(k) are denoted as qx(k) and qy(k), respectively. The matrix A is a 2x2 diagonal state transition matrix with elements a1 and b1. The measurement equation of the Kalman filtering is given by z x ( m, n ) = v x ( m, n ) + n x ( m, n )

(3.40)

z y ( m , n ) = v y ( m , n ) + n y ( m, n )

(3.41)

where n x ( m, n ) ∼ N ( 0, rx ( m, n )) and n y ( m, n ) ∼ N ( 0, ry ( m, n )) are the components of measurement error in horizontal and vertical directions. They are assumed independent and their distributions are assumed to be zero-mean Gaussian distributions with variances rx(m, n) and ry(m, n), respectively. Replacing the spatial location (m, n) with the time index k in the raster scan order, and reorganize equations (3.40) and (3.41) into matrix form, we obtain the measurement equation of the Kalman filter as Z (k ) = H ⋅ V (k ) + N (k ) ⎛ z x (k ) ⎞ ⎛ n (k ) ⎞ ⎛ 1 0⎞ ⎟⎟ N ( k ) = ⎜⎜ x ⎟⎟ and H = ⎜⎜ where Z ( k ) = ⎜⎜ ⎟⎟ . , , ⎝ 0 1⎠ ⎝ z y (k ) ⎠ ⎝ n y (k ) ⎠

(3.42)

87 Here, N(k) = [nx(k), ny(k)]T is the vector for measurement noise and its covariance matrix is denoted by R. H is the measurement matrix. Based on the new observation Z(k) at time index k, the Kalman filter updates the predicted V(k) which is obtained with the previous observation Z(k-1). For the sake of simplicity and without loss of generality, we assume that the x and y components of W(k) have the same variance qx(k) = qy(k) = q(k) and the components of N(k) have the same variance rx(k) = ry(k) = r(k). 2) 2-D causal AR model and state-space representation A 2-D causal model is used to exploit the motion information of two-dimensional neighboring blocks that are processed before the current block. The left-neighbor and the top-neighbor blocks of the current block, which usually have the highest motion correlation with the current one, are chosen for the 2-D AR model. Hence, the 2-D AR model is represented by v x ( m, n ) = a1 ⋅ v x ( m, n − 1) + a 2 v x ( m − 1, n ) + w x ( m , n )

v y ( m, n ) = b1 ⋅ v y ( m, n − 1) + b2 v y ( m − 1, n ) + w y ( m, n )

(3.43) (3.44)

The Kalman filtering is defined in one-dimensional form using time index. For multi-dimensional cases (2-D or higher), it is difficult to establish a suitable high dimension state-space representation [74] and the Kalman filtering can not be employed directly. Since the block-matching is processed in a raster scan order, in [70] the 2-D problem is transformed into an equivalent 1-D problem by replacing the block location indexes (m, n), (m, n-1), and (m-1,n) in equations (3.43) and (3.44) with the 1-D index k, k-1, and k-2, respectively. Thus, equations (3.43) and (3.44) are transformed into the 1-D form as

88 v x ( k ) = a1 ⋅ v x ( k − 1) + a 2 v x ( k − 2) + w x ( k )

(3.45)

v y ( k ) = b1 ⋅ v y ( k − 1) + b2 v y ( k − 2) + w y ( k )

(3.46)

Reorganize the equation (3.45) into the state equation form for the component x, we get

⎡ x1 ( k ) ⎤ ⎡a 01 ⎢ x ( k )⎥ = ⎢ 1 ⎣ 2 ⎦ ⎣

a10 ⎤ ⎡ x1 ( k − 1) ⎤ ⎡1⎤ + w x (k ) 0 ⎥⎦ ⎢⎣ x 2 ( k − 1) ⎥⎦ ⎢⎣0⎥⎦

(3.47)

where x1(k) = vx(k) and x2(k) = vx(k-1). Similarly, the measurement equation for the component x can be written in matrix form as

⎡ x (k ) ⎤ z x ( k ) = [1 0]⎢ 1 ⎥ + n x ( k ) ⎣ x2 (k )⎦

(3.48)

State and measure equations for the component y can be derived in the same way. As in 1-D case, the components of the prediction error W(k) and the measurement error N(k) are assumed to have zero-mean Gaussian distributions with the same variance q(k) and r(k), respectively. 3.2.4

A New Model for Adaptive Kalman Filtering (New-AKF)

For the KF to produce the optimal results, the correct values of A, H, W, and N need to be assigned to them. However, because the motion of video sequences is unknown before the Kalman filtering, this is usually impossible. Thus, estimations of these values must be done first. For non-adaptive Kalman filtering, the state parameters are assigned with pre-defined values which are fixed during the whole process. For adaptive Kalman filtering, the state parameters are updated during the process to reflect the state transitions. For simplicity and without loss of generality, we assume that A and

89 H are constants. We also assume that the x and y components of W(k) have the same variance q(k) and the components of N(k) have the same variance r(k). An adaptive scheme was developed in [70] which estimates the value of q(k) and r(k) based on two distortion functions D1 and D2. The definition of D1 and D2 are given by

D1 =

D2 =

1 N−1 N−1 ~ B (m + j, n + h) − Bi−1 (m + j + z(k) x , n + h + z(k) y ) 2 ∑∑ i N j=0 h=0

1 N2

N −1 N −1

~

∑∑ B (m + j, n + h) − B j =0 h=0

i

i −1

(m + j + vˆ(k ) −x , n + h + vˆ(k ) −y )

(3.49)

(3.50)

where i is the current frame index. D1 is the mean absolute difference (MAD) between

~ the current block Bi and the motion-compensated block Bi −1 using the measured MV z(k)=[z(k)x, z(k)y]T. D2 is the MAD between the current block and the motioncompensated block using the predicted MV Vˆ − ( k ) = [ vˆ ( k ) −x , vˆ ( k ) −y ]T . N is the side of the block and (m, n) is the location index of the block in the current frame. From (3.49) and (3.50) we can see that D1 and D2 are the distortions due to the measurement error and prediction error. They were used to calculate the values of q(k) and r(k) in [70]. The assumption behind the method developed in [70] is that the measurement error N(k) and prediction error W(k) are the only sources of the distortion of the motion compensated block. However, this is usually not true. In most cases, even if we know the real motion or the true value of the block MV, which means there are no measurement and prediction errors, the difference between the motion compensated block and the current block is usually not zero. This may be caused by other factors such as illumination changes or the changes of covered/uncovered areas due to object motion in

90 the frame. All these can cause the dissimilarity between the current block and the motioncompensated block even if we know the true value of MV or the optimal matching block is found. To take these factors into consideration and to have a model more suitable for the real world, in this work we propose a new method for estimating the error variances q(k) and r(k). First, we introduce a new distortion function D0 which is the MAD between the ~ current block Bi and the motion-compensated block Bi −1 using the true MV V(k)=[v(k)x,

v(k)y]T. Hence

D0 =

1 N−1 N−1 ~ B (m + j, n + h) − Bi−1 (m + j + v(k) x , n + h + v(k) y ) 2 ∑∑ i N j=0 h=0

(3.51)

Here, V(k) is the true or the optimal value of MV which yields the best possible motion compensated image and the least distortion value D0. It is obvious that the following conditions are satisfied. D0≤D1 and D0≤D2

(3.52)

D0 is the distortion from the real MV. If D0 is zero, this means that the block can

~ be reconstructed exactly from the reference block B i-1 with no distortion. However, in most cases D0 is a non-zero value. The less the value of D0, the better the motion compensated block from the real MV. Since the real value of MV is unknown, the value of D0 can only be estimated. Based on the fact of (3.52), in our work, D0 is estimated by D0 = {Min ( D1 , D2 )}⋅ Ra

0 ≤ Ra ≤ 1

(3.53)

where Ra is the weight factor ranging from 0 to 1. From our experiments, we observed that usually D1> D0

Ra ↓

Second, we define a new parameter “std_mv” which is the average of the standard deviation (STD) of the measured MVs components across the frame. std _ mv = mean (σ x , σ y ) M N ⎛ 1 mv x (i, j ) − mv x where σ x = ⎜⎜ ∑∑ M ⋅ N − 1 i = j = 1 1 ⎝

)

⎞ ⎟ ⎟ ⎠

M N ⎛ 1 σ y = ⎜⎜ ∑∑ mv y (i, j ) − mv y ⎝ M ⋅ N − 1 i =1 j =1

)

⎞ ⎟ ⎟ ⎠

(

(

2

2

(3.55)

1/ 2

1/ 2

Here σ x is the standard deviation of the component x of the measured MVs and σ y is the standard deviation of the component y of the measured MVs. M and N are the number of columns and rows of macro-block in the frame. We use std_mv to evaluate the motion complexity and the magnitude of the measurement and/or prediction errors. A small value of std_mv indicates the motion is steady or smooth. In this case, it is more likely to exactly reconstruct the current block from the motion-compensated block using the true MV. Thus D0 is close to zero and so is

92 Ra. In contrast, if std_mv is a large value, which means the motion is complex or fast. In this case it is more unlikely to exactly reconstruct the current block from the motioncompensated block using the true MV. Thus D0 is more possible to be a large value that is close to the value of D1 or D2, and Ra should be close to 1. Based on the analysis above, the rule for deciding the value of Ra is described in the following pseudo code. If std_mv ≤ T1, Ra = 0.2 elseif T1 < std_mv ≤ T2, Ra = 0.3 elseif T2 < std_mv ≤ T3, Ra = 0.45 elseif T3 < std_mv ≤ T4, Ra = 0.6 elseif T4 < std_mv ≤ T5, Ra = 0.7 elseif T5 < std_mv ≤ T6, Ra = 0.8 else Ra = 0.9 end The values of the thresholds T1 through T6 are obtained experimentally. In our experiments for the sequences with CIF format, they are given the values 0.1, 0.2, 0.5, 1, 1.7, and 2.5, respectively. Base on the definitions of D0, D1, and D2, we assume that the difference between D1 and D0 is due to measurement error and the difference between D2 and D0 is due to prediction error. Thus, in our work variances q(k) and r(k) are estimated by

( D2 − D0 ) 2 q( k ) = ( D 2 − D0 ) 2 + ( D1 − D0 ) 2

(3.56)

( D1 − D0 ) 2 ( D 2 − D0 ) 2 + ( D1 − D0 ) 2

(3.57)

r(k ) =

93 Notice that due to the introduction of D0, in equations (3.56) and (3.57) the variances q(k) and r(k) are calculated using the terms that are in the form of square instead of cube as used in [70]. According to our experiments the new equations can produce better results. The method developed here for estimating q(k) and r(k) can be applied to both the adaptive Kalman filtering based on 1-D and 2-D models in the same way. During the Kalman filtering procedure, the distortion function D0, D1, and D2 are calculated for each block and the values of q(k) and r(k) are updated accordingly. From the description above, a new adaptive Kalman filtering algorithm is developed which adaptively adjusts the state parameter variances q(k) and r(k) according to the values of D0, D1, and D2. 3.2.5 Summary of the Proposed Method We summarize the proposed adaptive algorithm as the following steps: Step 1) Do block-matching for the MBs in the current frame and obtain the measured MVs for each MB. Calculate the value of std_mv. Step 2) Select the 1-D or 2-D model for the adaptive KF and set the initial values for the model parameters. Step 3) Calculate the value of D1 and decide the value of Ra using the method described in section 3.2.4. Step 4) Adaptive Kalman filtering: a) The predicted MV is obtained by:

Vˆ − ( k ) = A( k − 1)Vˆ + ( k − 1) b) Compute D2 and D0, and update the values of q(k) and r(k).

(3.58)

94 c) Compute the covariance of prediction error by: P − ( k ) = A( k − 1) P + ( k − 1) A T ( k − 1) + Q ( k − 1)

(3.59)

d) Calculate the Kalman gain: K ( k ) = P − ( k ) H T ( k )[ H ( k ) P − ( k ) H T ( k ) + R ( k )] −1

(3.60)

e) Update the MV estimate and output the final MV: Vˆ + ( k ) = Vˆ − ( k ) + K ( k )[ Z ( k ) − H ( k )Vˆ − ( k )]

(3.61)

f) Update the prediction error covariance: P + (k ) = [ I − K (k ) H (k )]P − (k )

(3.62)

Go to step 3 for the next block. From the steps described above, we can see that after the Kalman filtering the motion estimate V(k) is usually a real value, thus yielding fractional-pixel accuracy estimate. In conventional BMAs, MV with fractional-pixel accuracy is obtained by interpolation which usually involves a large amount of computation. The proposed method requires much less computation comparing to that. And since the same Kalman filter is employed in both the encoder and decoder, it requires no additional MV overhead bit rate. A brief discussion of the computational complexity is given in the next section. 3.2.6 Experimental Results The algorithm [75] developed in this chapter for the 1-D and 2-D AR model based KF is compared with the method used in [69] and [70] using a set of standard video sequences with CIF resolution. For every sequence, only the component Y is computed to demonstrate the effectiveness of the proposed method. MAD is selected as the cost function for block-matching. For experimentation simplicity, video frames are in the

95 sequence of “IPPPP…” and each P frame uses its pervious frame as the reference frame. Motion compensation is calculated with pixel accuracy. Since we focus on the effectiveness of the proposed algorithm, we assume that the frame compression is lossless. This means all the frames can be exactly reconstructed from the motioncompensated and residual images. 3.2.6.1 Experimental Results Based on 1-D AR Model

TSS TSS with non-adaptive KF TSS with adaptive KF TSS with New-AKF 36

PSNR (dB)

34

32

30

28

26

0

5

10 15 20 Frame index of "Foreman (CIF)"

25

30

Figure 3.5 PSNR performance of the first 30 frames of “Foreman.”

96 DS DS with non-adaptive KF DS with adaptive KF DS with New-AKF 25 24

PSNR (dB)

23 22 21 20 19 18 17

0

5

10 15 20 Frame index of "Bus (CIF)"

25

30

Figure 3.6 PSNR performance of the first 30 frames of “Bus.”

Figure 3.5 shows the PSNR performance of “Foreman” by using TSS, TSS with non-adaptive KF (NKF), TSS with the adaptive KF (AKF) used in [69] and [70], and TSS with the new adaptive KF (New-AKF) proposed in this work. For the non-adaptive KF, q(k) = 0.8 and r(k) = 0.2. In Figure 3.6, diamond search (DS) is used as the blockmatching method. Figure 3.5 and Figure 3.6 reveal that the proposed algorithm has the best performance among the methods in comparison. The average values of the PSNRs obtained by various methods are listed in Table 3.1 and Table 3.2. As can be seen from Table 3.1, when TSS is used as the BMA, NewAKF over performs AKF for all the video sequences. The largest improvement occurs for sequence “Bus” where the proposed method has about 0.6dB improvement over AKF.

97 When DS is utilized as the BMA, the proposed method over performs AKF for most of the video sequences except for “Mother&daughter” as shown in Table 3.2. Table 3.1 and Table 3.2 also show that the proposed algorithm is more effective for the videos that have large and complex motion, e.g., “Bus”, than for the sequences that have small movement, such as “Akiyo”.

Table 3.1 Average PSNR(dB) of the first 30 P frames of the test video sequences with TSS as the BMA Test Sequences

TSS

TSS+NKF

TSS+AKF

TSS+ New-AKF

Carphone

32.0834

31.682

32.9316

33.0231

Mobile Calendar

23.3983

23.4439

23.852

23.9139

Foreman

30.9807

30.9839

32.1496

32.4668

Bus

22.855

21.4493

22.5839

23.1584

Mother&daughter

42.0835

40.0576

42.0988

42.1069

Akiyo

43.6324

43.8449

44.0273

44.0376

Flower Garden

24.8943

24.5559

25.7252

25.8371

News

38.3237

38.0791

38.5233

38.5255

98 Table 3.2 Average PSNR(dB) of the first 30 P frames of the test video sequences with DS as the BMA Test sequence

DS

DS+NKF

DS+AKF

DS+ New-AKF

Carphone

33.0035

33.1694

33.6618

33.6649

Mobile Calendar

23.5718

23.9691

24.5274

24.5463

Foreman

32.0045

32.4462

33.4069

33.4658

Bus

20.7197

20.3134

21.2497

21.7195

Mother&daughter

42.2362

41.8784

42.3064

42.3021

Akiyo

43.7575

43.9244

44.0375

44.048

Flower Garden

26.1742

26.6338

27.0665

27.0808

News

38.4449

38.4773

38.6701

38.6727

TSS is selected here as an example to illustrate the increase of computation by the New-AKF. Minus is regarded as addition and division is regarded as multiplication. Consider the case that the block size is N×N. The number of search locations required by TSS is 25. Each search corresponds to one MAD calculation which consists of (2N2-1) additions, N2 absolute operations, and one multiplication. Thus, the total computation of TSS for one MB is 25·(2N2-1) additions, 25·N2 absolute operations, and 25 multiplications. Computation complexity analysis of the KF is similar to that in [69]. Simplification methods are employed to reduce computation load of the New-AKF. For example, sum of absolute difference (SAD) is used instead of MAD for computing the error functions D1 and D2. Intermediate results can be reused and r(k) is computed from

99 r(k)=1-q(k) instead of using equation (3.57). The computation required by TSS, NKF, AKF, and the New-AKF are listed in Table 3.3. In Table 3.3, N is the MB’s side size and m is the total number of MBs in a frame. From Table 3.3 it can be concluded that the additional computational load caused by the New-AKF is about 4% of the computation of TSS, which is very small and close to that of AKF.

Table 3.3 Computation required for each macro-block Computation complexity

Algorithms

Addition

Multiplication

Absolute

TSS

25·(2N2-1)

25

25·N2

NKF

5

3

0

AKF

2N2 + 6

8

N2

New-AKF

2N2 + 14 - 3/m

9 + 5/m

N2

3.2.6.2 Experimental Results Based on 2-D AR Model We also compare the performance of the proposed KF algorithm with the method used in [70] based on 2-D AR model. Figure 3.7 and Figure 3.8 are the PSNR performance of the first 30 frames of the “Carphone” using DS and ARPS as the BMAs, respectively. Notice that for the sequence “Carphone”, the proposed algorithm over performs the method in [70] and [69] in both 1-D and 2-D cases.

100 DS DS with DS with DS with DS with

adaptive KF (1-D) new adaptive KF (1-D) adaptive KF (2-D) new adaptive KF (2-D)

37 36

PSNR (dB)

35 34 33 32 31 30

0

5

10 15 20 Frame index of "Carphone (CIF)"

25

30

Figure 3.7 PSNR performance of the first 30 frames of “Carphone” with DS as the BMA.

ARPS ARPS with adaptive KF (1-D) ARPS with new adaptive KF (1-D) ARPS with adaptive KF (2-D) ARPS with new adaptive KF (2-D) 36 35

PSNR (dB)

34 33 32 31 30

0

5

10 15 20 Frame index of "Carphone (CIF)"

25

30

Figure 3.8 PSNR performance of the first 30 frames of “Carphone” with ARPS as the BMA.

101 Table 3.4 Average PSNR(dB) of the first 30 frames of test video sequences with DS as the BMA Video Sequence

DS

DS +

DS +

DS + DS + AKF(1-D) New-AKF(1-D) AKF(2-D) New-AKF(2-D)

Carphone

33.0035 33.6618

33.6649

33.7346

33.8012

Mobile Calendar

23.5718 24.5274

24.5463

24.7998

24.8829

Foreman

32.0045 33.4069

33.4658

33.4186

33.5725

Bus

20.7197 21.2497

21.7195

20.7166

20.8807

Mother&daughter 42.2362 42.3064

42.3021

42.3685

42.3926

Akiyo

43.7575 44.0375

44.048

44.1649

44.1081

Flower Garden

26.1742 27.0665

27.0808

27.1814

27.1875

Table 3.5 Average PSNR(dB) of the first 30 frames of test video sequences with ARPS as the BMA Video Sequence ARPS

Carphone

ARPS +

ARPS +

ARPS +

ARPS +

AKF(1-D) New-AKF(1-D) AKF(2-D) New-AKF(2-D)

32.9713

33.5399

33.5173

33.7041

33.7546

Mobile Calendar 23.6103

24.5374

24.5525

24.8045

24.8806

Foreman

32.0906

33.4148

33.432

33.5672

33.6771

Bus

22.1827

22.3374

22.3867

22.3235

22.5149

Mother&daughter 42.2026

42.265

42.2615

42.318

42.3388

Akiyo

43.7228

43.9922

44.0043

44.1062

44.0483

Flower Garden

26.1845

27.0518

27.0684

27.1747

27.188

102 The mean values of the PSNRs obtained by various methods are listed in Table 3.4 and Table 3.5 using DS and ARPS as the BMAs, respectively. From Table 3.4 and Table 3.5, it can be concluded that the proposed method can increase the PSNR over the original block-matching algorithm significantly. The New-AKF and AKF methods based on the 2-D model usually over perform the ones based on the 1-D model. The New-AKF approach developed in this work over performs the AKF method proposed in [70] for most of the video sequences in both 1-D and 2-D cases except for the “Akiyo”. Further, in most cases the New-AKF based on 2-D model yields the best results among the methods considered herein. Table 3.4 and Table 3.5 also show that the proposed algorithm is more effective for the videos that have fast or complex motion than for the sequences that have slow movement. We use diamond search (DS) as an example here to illustrate the additional computation caused by the New-AKF based on 2-D AR model. For simplicity, both addition and minus are regarded as additions. And both multiplication and division are regarded as multiplications. For a block with the size of N×N, the number of points need to be checked by DS is video content dependent and it is usually a value between 13 and 24. Here, we assume the average number of search points for DS is 19. Each search need to calculate MAD once which includes (2N2-1) additions, N2 absolute operations, and one multiplication. Thus, the total computation for DS per MB is 19·(2N2-1) additions, 19·N2 absolute operations, and 19 multiplications. To reduce computation, some simplification methods are used. For example, the error functions D1 and D2 are calculated by using the sum of absolute difference (SAD) instead of MAD. Variance r(k) is computed from

103 r (k ) = 1 − q(k ) instead of using the values of D0, D1, and D2 directly. Our analysis and experiments indicate that the increase of computation due to the New-AKF is about 5% of the computation of DS, which is very small and close to that of AKF in [70].

104 CHAPTER 4:

KALMAN FILTERING BASED MOTION ESTIMATION WITH ADAPTIVE BLOCK PARTITIONING 4.1

Introduction

For block-based motion estimation, each frame is usually divided into an array of macro-blocks which have the size of 16×16. Smaller blocks for block-matching may results in better compensated images than larger blocks because more accurate description of motion is possible. However it also incurs more computation due to more motion vectors need to be estimated for the same frame. In the previous chapter, we introduced a new adaptive algorithm for applying the Kalman filtering in block-based motion estimation and compared it with the existing KF methods. However, in all these methods, the Kalman filter was implemented on the blocks with fixed size, thus, resulting in limited performance improvement. Based on the observation that splitting macro-blocks into smaller blocks sometimes may bring about better motion estimation, an 8x8-block based motion estimation algorithm was proposed in [68] which splits macro-blocks into 8x8 blocks and uses the KF technique to improve the motion estimates. In this chapter, a more effective and flexible scheme [76] is proposed which further improves the performance of the Kalman filtering for motion estimation. By introducing two parameters to evaluate motion complexity and magnitude, 16x16 macroblocks (MBs) are split into 8x8 blocks or 4x4 sub-blocks adaptively before applying the Kalman filter. To further improve the performance of the Kalman filter, a zigzag

105 scanning of the blocks is proposed. Simulation results show that the method developed in this work yields more accurate motion vectors (MV) and better motion compensated images in terms of PSNR than the methods proposed in [70] and [68]. A benefit from the KF is the motion vectors with fraction pixel accuracy are obtained with no additional bit rate requirement. In addition, since the Kalman filtering method develop herein is independent of BMAs, it can be integrated with the existing block-based search schemes with little or no modification. 4.2 4.2.1

Algorithm Overview

Adaptive Block Partition Algorithm

In [70], the Kalman filter was applied to the MVs of 16x16 macro-blocks. To improve the motion estimation performance, macro-blocks were split into 8x8 blocks in [68] for the Kalman filtering. In our work, we propose a novel method so that the macroblocks are split into 8x8 blocks or 4x4 sub-blocks adaptively for the KF as shown in Figure 4.1. The adaptive scheme is based on the following phenomena we have observed: For the video sequences containing very smooth or very complex motion, splitting macroblocks into smaller blocks for the KF usually improves the motion estimation results. In such cases, the motion estimation resulting from the KF as applied to 4x4 sub-block MVs performs better than that of applying the KF to 8x8 block MVs. However, for the video sequences involving moderate complex or smoothness motion, the motion estimation performance using the 8x8 block based KF is better than that based on 16x16 macroblocks or 4x4 sub-blocks.

106

(a) 16x16 macro-block

(b) Split into 8x8 blocks

(c) Split into 4x4 sub-blocks

Figure 4.1 Partition of a 16x16 MB into 8x8 blocks or 4x4 sub-blocks.

From our study, the phenomena described above may be attributed to the following reasons: The KF utilizes the spatial correlation between the MVs of the neighboring blocks to refine the motion estimates. Thus, the smoother the motion or the higher the correlation between the MVs, the better the performance of the KF. On the other hand, for videos containing complex motion, partitioning MBs into smaller blocks is preferred since this makes it possible to accurately describe the motion. However, for complex motion, there is less smoothness or less correlation between the MVs. Therefore the performance of the KF is affected adversely. The phenomena described before are the results of the interaction of these two attributes. An illustration of the interaction of the two factors that affects the performance of block partition is shown in Figure 4.2. In addition, the computation load should also be taken into consideration. Partitioning MBs into 8x8 blocks or 4x4 sub-blocks incurs more computation for the Kalman filtering. Thus, when the benefit of the block partition and the KF is not obvious,

107 splitting macro-blocks into smaller blocks is not preferred. This is especially true for the video frames that have very smooth motion or a large percentage of static areas.

Very smooth motion

As smoothness ↑ KF performance ↑ Benefit of block partition ↓

As complexity ↑ KF performance ↓ Benefit of block partition ↑

Very complex motion

Figure 4.2 An illustration of the two features that affects the performance of block partition.

In our work, we introduce the following two variables for evaluating the complexity and magnitude of frame motion.

std _ mv = mean (σ x , σ y ) ave _ SAD =

1 M ⋅N

(4.1)

M −1 N −1

∑ ∑ SAD i =0 j =0

ij

(4.2)

Here, σx and σy are the standard deviation of the components of MVs in horizontal and vertical directions, respectively. std_mv is the same variable as defined in (3.55) which is the average value of σx and σy. It is used here as a measurement for motion complexity. The variable ave_SAD is the average of the sum of absolute difference (SAD) for the macro-blocks across the frame before applying the Kalman filter. It is used to evaluate the magnitude of the motion in the current frame. From the description above, we assume that the larger the value of STD, the more complex the frame motion; the larger the value of ave_SAD, the faster the motion in the

108 frame. Based on the analysis before, we develop a flexible scheme for the Kalman filtering with adaptive block partitioning (ABP). The method is described in the pseudo code as follows: if std_mv < T1 or ave_SAD < S1 No block partition and apply the KF on 16×16 MBs. else if T1 ≤ std_mv < T2 if ave_SAD ≤ S2 Split MBs into 4x4 blocks and apply the KF. else Split MBs into 8x8 blocks and apply the KF. end else Split MBs into 4x4 blocks and apply the KF. end end where T1, T2, S1, and S2 are thresholds for std_mv and ave_SAD, respectively. The procedure is also illustrated in Table 4.1. In Table 4.1, case 1 is for the frames which have very smooth and slow motion or have a large percentage of still area. In such case no block partition is carried out. Case 2 is for the frames which have not very smooth or very complex motion. In case 2a, the video frames have a relative slow motion and MBs are split into 4x4 blocks for the KF. In case 2b, the frames have a relative fast motion and

109 MBs are split into 8x8 blocks for the KF. Case 3 is for the frames which have very complex motion. In such case MBs are split into 4x4 blocks for the KF. Depending on different block-matching algorithms, the values for the thresholds may differ slightly. In our test on the video sequences with CIF format, we chose empirically that T1=0.2, T2=2.8, S1=100, and S2=800.

Table 4.1 Partition of macro-blocks based on the values of std_mv and ave_SAD Case 1: std_mv < T1 or ave_SAD < S1

Case 2: T1 ≤std_mv < T2

No block partition and apply the KF on 16×16 MBs.

Case 2a:ave_SAD ≤ S2

Split MBs into 4x4 blocks and apply the KF.

Case 2b:ave_SAD > S2

Split MBs into 8x8 blocks and apply the KF.

Case 3: std_mv > T2

Split MBs into 4x4 blocks and apply the KF.

4.2.2 Zigzag Scan for Blocks and Sub-blocks In [68], the scanning of macro-blocks in a frame is in a raster order from the topleft frame corner to the bottom-right corner. Within each macro-block, 8x8 blocks are scanned in a zigzag order as illustrated in Figure 4.3(a) to produce the 8x8 block sequence for the Kalman filtering. To further utilize the spatial correlations between 8x8 blocks or 4x4 sub-blocks from the same or different macro-blocks, we adopt a zigzag scan order as illustrated in Figure 4.3 (b) and (c). In our method the zigzag scanning for 8x8 blocks or 4x4 sub-blocks is extended to the whole frame. In this way, during the Kalman filtering, the MV of the block k is estimated from its previous block with index

110 k-1 in the scan order which is always one of its nearest neighbors. As a result, a better motion estimate is expected from the scan order.

8 8

···

· · ·

···

(a) 8

4 4

8

· ··

·· ·

(b)

· · ·

· ··

(c)

Figure 4.3 Different scan orders for 8x8 blocks and 4x4 sub-blocks.

4.2.3 Motion Modeling and State-Space Representation The motion vector of a macro-block is defined as V=[vx,vy]T, and the two horizontal and vertical components, vx and vy, are assumed independent. The measured MV of an MB is obtained by a block-matching method running on a 16x16 MB basis. The 4 blocks or 16 sub-blocks split from the same MB have the same measured MV. This means that instead of having one MV measurement for each MB for motion estimate adjustment, the Kalman filter has 4 same MV measurements for 8x8 blocks or 16 same MV measurements for 4x4 sub-blocks for ME adjusting. The motion vector of a

111 block/sub-block can be estimated from its previous block/sub-block according to the index k in the zigzag sequence using the state equation of the Kalman filter given by V ( k ) = A( k − 1)V ( k − 1) + W ( k )

(4.3)

where the vector for state noise is denoted by W(k)= [wx,wy]T and its covariance matrix is Q. W(k) has two components wx and wy, which are assumed independent of each other. They are assumed to have zero-mean Gaussian distributions and variances (qx, qy), respectively. A(k-1) is the transition matrix. The measurement equation for the KF is given by Z (k ) = H (k ) ⋅ V (k ) + N (k )

(4.4)

where the vector for measurement noise is denoted by N(k)= N[nx,ny]T and its covariance matrix is R. H(k) is the measurement matrix. nx and ny are the two noise components of N(k) and they are assumed to be independent of each other. Generally, they are assumed to have zero-mean Gaussian distributions and variances (rx, ry), respectively. Based on the new measurement Z(k) obtained by using a conventional BMA at time index k, the Kalman filter updates the prediction of V(k) which is obtained with the previous measurement Z(k-1). 4.2.4 Kalman Filtering with Adaptive State Parameters Adjustment To further improve the performance of the KF, the effective adaptive KF (NewAKF) method developed in the pervious chapter is adopted here. In addition to the measurement error function D1 and the prediction error function D2 defined in [70], a new distortion function D0 is introduced for New-AKF which is the MAD between the current block and the block motion-compensated from the real motion vector v(k)=[vx(k),

112 vy(k)]T. Without loss of generality, matrix A and H are assumed to be constants. The components of W(k) are assumed to have the same variance q(k) and the components of N(k) are assumed to have the same variance r(k). Based on the values of D0, D1, and D2, the state parameters q(k) and r(k) are adjusted adaptively during each iteration of the KF. More detailed description of the New-AKF algorithm is given in chapter 3. 4.2.5 Summary of AKF2-ABP Method For the sake of clarity, in the remaining parts of the chapter, the adaptive KF developed in [70] is denoted as AKF1 and the new adaptive KF method developed by us in chapter 3 is denoted as AKF2. In this subsection, the adaptive Kalman filtering algorithm (AKF2) is combined with the adaptive block partitioning (ABP) and the proposed AKF2-ABP algorithm is summarized as follows: Step 1) Measure MV and split macro-blocks adaptively. Measure the MV for each MB of the current frame using a BMA. According to the scheme described in section 4.2.1, make the block partition decision and split all the MBs in the current frame into blocks or sub-blocks accordingly. After block partition, assign the measured MV to each block or sub-block and scan the blocks/sub-blocks in a zigzag order to obtain a sequence of MVs for the Kalman filtering. Step 2) Kalman filtering the measured MVs sequence. a) Predict MV before each measurement using: Vˆ − ( k ) = A( k − 1)Vˆ + ( k − 1) b) Compute D0, D1, and D2, and update the state parameters of the KF. c) Compute the covariance of prediction error via:

(4.5)

113 P − ( k ) = A( k − 1) P + ( k − 1) A T ( k − 1) + Q ( k )

(4.6)

d) Calculate the Kalman gain: K ( k ) = P − ( k ) H T ( k )[ H ( k ) P − ( k ) H T ( k ) + R ( k )] −1

(4.7)

e) Update the MV estimate using the measurement Z(k) and output the final MV: Vˆ + ( k ) = Vˆ − ( k ) + K ( k )[ Z ( k ) − H ( k )Vˆ − ( k )]

(4.8)

f) Update the prediction error covariance: P + (k ) = [ I − K (k ) H (k )]P − (k )

(4.9)

Go to step 2.a) for the next MB/block/sub-block. After the Kalman filtering, the components of the MV estimate V(k) are usually real values, resulting in fractional-pixel accuracy motion estimate. Since the Kalman filter is employed in the encoder and decoder in the same way, no additional bit is required to be transmitted for MVs. Some of the Matlab simulation code can be found in Appendix C. 4.3

Experimental Results

We compare the performance of the proposed algorithm with the method employed in [68] and [70] using standard image sequences as listed in Table 4.2 and Table 4.3. In the test, all the sequences have CIF format except for “Tennis” and “Football”, which have QCIF and SIF format, respectively. For every image sequence, only the component Y is computed to illustrate the effectiveness of the work described herein. MAD is used as the cost function for block-matching. In our experiments, video frames are coded in the sequence of “IPPPP…” and each P frame uses its immediate

114 pervious frame as the reference frame. Since the focus is on the effectiveness of the proposed algorithm, we assume that the frame compression is lossless and all the frames can be exactly reconstructed from the motion-compensated and residual images. Because of the block partition and the zigzag scan order for the sub-blocks, to reduce implementation complexity, the proposed method is only tested on 1-D AR model. Instead of using TSS as the BMA in [68] and [70], DS and ARPS are used as the BMAs to generate measured MVs and motion estimation is calculated with pixel accuracy. Figure 4.4 shows the MV fields of the 25th frame of the “Carphone” sequence generated using DS and AKF2 with different block partitions. Four different approaches are tested in our experiments for comparison. They are the conventional BMA, BMA with the AKF1 applied to 16x16 MBs used in [70], BMA with the AKF1 based on 8x8 blocks and the scan order used in [68], and BMA with the AKF2 and adaptive block partitioning (AKF2-ABP) proposed by us, respectively. Figure 4.5 shows the PSNR performance of the “Football” sequence with DS as the BMA. In Figure 4.6, ARPS is used as the fast BMA and the PSNR performance of the four methods mentioned above is compared for the “Bus” sequence. From Figure 4.5 and Figure 4.6 we can see that the proposed algorithm improves the performance of the fast BMAs effectively.

115

(a) MVs by DS on 16x16 macro blocks.

(c) MVs by DS with AKF2 on 8x8 blocks.

(b) MVs by DS with AKF2 on 16x16 macro blocks.

(d) MVs by DS with AKF2 on 4x4 sub-blocks.

Figure 4.4 MVs of the 25th frame of “Carphone” sequence generated by DS and AKF2 on different block partition.

116 DS on 16x16 macro-blocks DS with AKF1 on 16x16 macro-blocks DS with AKF1 on 8x8 blocks DS with AKF2 and adaptive block partition 24 23.5

PSNR (dB)

23 22.5 22 21.5 21

0

10

20 30 Frame index of "Football (CIF)"

40

50

Figure 4.5 PSNR performance of the first 50 P frames of “Football” with DS as the BMA and search parameter p = 15.

ARPS on 16x16 macro-blocks ARPS with AKF1 on 16x16 macro-blocks ARPS with AKF1 on 8x8 blocks ARPS with AKF2 and adaptive block partitioning 24.5 24

PSNR (dB)

23.5 23 22.5 22 21.5 21

0

10

20 30 Frame index of "Bus (CIF)"

40

50

Figure 4.6 PSNR performance of the first 50 P frames of “Bus” with ARPS as the BMA and search parameter p = 15.

117 Table 4.2 and Table 4.3 list the average PSNR values of the motion compensated images for the first 50 P frames obtained by various methods. The numbers in parenthesis denote the size of macro-block (or block) that the KF applied to. From Table 4.2 and Table 4.3, we conclude that the proposed method produces the best PSNR results among the algorithms in comparison. When DS is used as the BMA, the overall average PSNR improvement of the developed method over the other three schemes in Table 4.2 are 0.82dB, 0.28dB, and 0.46dB, respectively. When ARPS is selected as the BMA, the corresponding improvements are 0.78dB, 0.30dB, and 0.46dB, which are very impressive.

Table 4.2 Average PSNR (dB) of the first 50 P frames of the test video sequences using DS and adaptive Kalman filtering Videos

DS

DS +AKF1 (16x16)

DS +AKF1 (8x8)

DS + AKF2-ABP

Carphone(CIF)

33.5548

34.2017

33.9835

34.3777

Mobile(CIF)

23.8275

24.7598

24.2167

24.9533

Foreman(CIF)

32.3274

33.4842

32.9288

33.5538

Bus(CIF)

21.2797

21.756

21.4553

22.1093

Mother&daughter(CIF)

41.6457

41.8846

41.8201

42.0733

Akiyo(CIF)

43.0379

43.3753

43.2526

43.4185

Flower Garden(CIF)

26.3233

27.2574

27.1261

27.4955

News(CIF)

38.4463

38.6414

38.6366

38.927

Tennis(QCIF)

29.7339

30.1426

30.0759

30.6756

Football(SIF)

21.7430

21.8285

21.9610

22.5125

118 Table 4.3 Average PSNR (dB) of the first 50 P frames of the test video sequences using ARPS and adaptive Kalman filtering Videos

ARPS

ARPS+ AKF1 ARPS+ AKF1 ARPS+ AKF2-ABP (16x16)

(8x8)

Carphone(CIF)

33.5031

34.0944

33.8696

34.3044

Mobile(CIF)

23.869

24.7748

24.2454

24.9804

Foreman(CIF)

32.3829

33.461

32.9484

33.6038

Bus(CIF)

22.3702

22.5566

22.5229

23.0141

Mother&daughter(CIF)

41.6072

41.8205

41.763

42.0146

akiyo(CIF)

43.0124

43.3246

43.2014

43.3606

Flower Garden(CIF)

26.3321

27.2502

27.1371

27.5035

News(CIF)

38.3828

38.5815

38.5644

38.8519

Tennis(QCIF)

29.1686

29.5319

29.4563

30.1196

Football(SIF)

21.7652

21.8275

21.9351

22.4894

To illustrate the increase of computation complexity, DS is used as an example here. Consider the case that the block size is NxN and the block-matching criterion is MAD. In block-matching, the number of locations to be checked by DS for each block is usually a value ranges from 13 to 24. Here, we assume the average number of DS search locations is 19. Each search corresponds to one MAD calculation which consists of (2N1)2 additions, N2 absolute operations and one multiplication. For a normal DS, the total computation for one MB is 19·(2N2-1) additions, 19·N2 absolute operations and 19 multiplications. Computation complexity analysis of the KF is similar to that in [69]. From our analysis in the previous chapter, the additional computational load caused by

119 the AKF2 is close to that of AKF1 in [70] and is about 5% of the computation of DS. The computation for adaptive block partition is carried out only once per frame. For each MB, the average computation incurred by ABP is about 5 additions and 2 multiplication, which is very small and can be neglected comparing with the computation of blockmatching for each macro-block. When a MB is split into 8x8 blocks or 4x4 sub-blocks, the computation incurred by the AKF2 is increased by 4 or 16 times, respectively. Thus, for 16x16 MB, 8x8 blocks, and 4x4 sub-blocks, the additional computation incurred by the AKF2-ABP is roughly 5%, 20%, and 80% of the computation of DS. Since in most cases MBs are split into 8x8 blocks, the additional computation of the method developed herein is roughly 20% of the computation of DS. Test on high resolution videos Since high definition videos are expected to be more and more popular in future video applications, we also test our methods on videos with high resolution. The test video sequences and their formats are listed in Table 4.4. The resolution for videos with 4CIF format is 704×576. The resolution for videos with 720p format is 1280×720. Note that for high resolution videos, because of the increase of resolution for each frame, each macro-block (16x16) represents a much smaller area than the MBs for standard videos. In addition, the computation for block-matching also increased greatly due to the increase of pixels for each frame. Based on the characteristics of high resolution videos, in our method, a set of different threshold values are selected for making block partition for high resolution videos. Diamond search is use here for blockmatching and the test results are listed in Table 4.4.

120 Table 4.4 Average PSNR (dB) of the first 30 P frames of the test video sequences using DS and adaptive Kalman filtering videos

DS

DS +AKF (16x16)

DS +AKF (8x8)

DS +new AKF +ABP

City (4CIF)

30.6161

31.2209

30.9687

31.396

Ice (4CIF)

33.567

33.828

33.96

34.503

Soccer (4CIF)

30.11

30.472

30.402

30.936

Park run (720p)

24.881

25.483

25.165

25.495

Shields(720p)

33.330

33.721

33.538

33.802

(4CIF: 704x576, 720p: 1280x720)

DS on 16x16 macro-blocks DS with AKF1 on 16x16 macro-blocks DS with AKF1 on 8x8 blocks DS with AKF2 and adaptive block partition 37 36

PSNR (dB)

35 34 33 32 31 30 29

0

5

10 15 20 Frame index of Ice (4CIF)

25

30

Figure 4.7 PSNR performance of the first 30 P frames of “Ice” with DS as the BMA and search parameter p = 15.

121 DS on 16x16 macro-blocks DS with AKF1 on 16x16 macro-blocks DS with AKF1 on 8x8 blocks DS with AKF2 and adaptive block partition 35 34.5

PSNR (dB)

34 33.5 33 32.5 32 31.5

0

5

10 15 20 Frame index of Shields (720p)

25

30

Figure 4.8 PSNR performance of the first 30 P frames of “Shields” with DS as the BMA and search parameter p = 15.

Table 4.4 lists the average PSNR values of the motion compensated images for the first 30 P frames obtained by various methods. From Table 4.4, we conclude that the proposed method yields the best PSNR results among the algorithms in comparison. The PSNR performance of the “Ice” and “Shields” are plotted in Figure 4.7 and Figure 4.8, respectively. Figure 4.7 and Figure 4.8 show that the proposed algorithm effectively improves the PSNR performance than the approaches in comparison. From the tests above we can see that the propose approach improves the performance of the existing methods effectively for a wide variety of video formats. Such

122 robust performance makes it an ideal temporal redundancy extraction engine for both standard and high-definition video transmission and new digital TV applications. In addition, since the proposed method can timely adjust the block partition according to the smoothness/complexity of the motion in each frame, it is more appropriate for videos containing frequent scene changes or scene cuts.

123 CHAPTER 5: AN ADAPATIVE VIDEO CODING SYSTEM BASED ON BINARY TREE BLOCK PARTITION AND MOTION ESTIMATION Besides the techniques mentioned in the chapters before which use fixed-size blocks (16x16 or 8x8 or 4x4) for motion estimation, another direction of research for video compression is to use variable-size blocks. Although fixed-size block matching (FSBM) has the advantages of simplicity and regularity, it also incurs blocking artifacts. According to recent research [37-42], motion estimation and compensation based on variable-size blocks has better compression performance than that of fixed-size blocks, thus having the potential for future applications. Variable-size block matching (VSBM) introduced by Chan, et al. [39] utilizes small blocks for complex motion areas and large blocks for smooth motion areas. The frame is partitioned into blocks using a binary tree with each step dividing a block into two equal half blocks. Quad-tree was used in [37, 40] for block partition. Servais, et al. introduced a VSBM approach [77] which makes use of binary tree for block partition. Instead of splitting blocks in halves as in [39], in Servias’s method blocks are split by a horizontal or vertical line whose location is decided by the motion compensation error surface. This makes more accurate description of the motion, resulting in better ratedistortion (R-D) [78] performance. However, the computation and implementation complexity of these techniques is very high which prevents them from real-time applications. Thus how to reduce computation complexity or processing time for encoding is a hot topic in this research area.

124 In this chapter, we will study a video codec which uses variable-size blocks of rectangular shape for motion estimation. The variable-size blocks are generated by a binary tree block partition. The prototype of the video codec was put forward in [77]. In our work, by incorporating the new algorithms and techniques proposed by us, a new adaptive encoder is developed. From our simulation results, the encoding speed of the encoder is improved significantly with almost no sacrifice of the final compressed video quality. In the following sections, we first give a brief introduction of the video codec. Then, the new algorithms developed by us are described in details. Experiment results are given to demonstrate the effectiveness of the proposed methods and conclusions are drawn finally. 5.1

Overview of the Binary Tree Video Coding System 5.1.1 Frame Type and Frame Encoding

A block diagram of the video encoder is illustrated in Figure 5.1. Three types of frame, frame I, frame P, and frame B are utilized in the encoder. I frames are encoded independently using still image transform encoding techniques. P frames are encoded from one reference frame. B frames are encoded from the two reference frames in two directions, the immediate pervious one and following one. Both I and P frames can be used as reference frames.

125

I frame

Intra Frame Encoding

Frame sequence

Intra Frame Decoding

P, B frame Entropy Encoding (MP) -

P frame

Motion Compensation

Output of the compressed video bitstream

Entropy Decoding (MP)

MV Motion Estimation

Buffer of Coded Frames

Figure 5.1 Block diagram of the video encoder.

5.1.2 Binary Tree Block Partition and Motion Estimation For a frame containing complex motion, the most efficient way to represent the motion is to split the frame into blocks with each block has uniform motion. However, to search and find the optimal block partition among all possible partitions is very computationally expensive. In this video encoder, a simple binary partition tree (BPT) based block splitting method is adopted. In this method a block is split into two children blocks using a straight horizontal or vertical line. And each children block can be further split in the same way. The direction of the line is decided by the sides’ length of the block that is being split. If the block’s width is longer than its height, a vertical line is used. Otherwise, a horizontal line is chosen.

126 The location of the line is decided by minimizing the motion compensation error which is measured by sum of squared differences (SSD). The definition of SSD is given by SSD =

∑ (C

− Ri +h , j +v )

2

ij ( i , j )∈current block

(5.1)

where Cij and Ri+h,j+v denote the pixels’ values of the current block and the reference block, respectively. The procedure of finding the optimal block partition can be summarized as follows: Based on a block’s width and height, divide it into strips. If the block’s width is larger than its height, divide it vertically. Otherwise, divide it in horizontal direction. Each vertical or horizontal strip consists of a column or a row of pixels. The lines between the strips are the candidate splitting lines for block partition. For each line, perform motion estimation for the two regions on each side of the line and find the motion estimation error (SSD) of each region. The line with the least value of the sum of the SSD of the two regions is the optimal location for block partition. In the video encoder, frames are split in an iterative way based on minimizing motion compensation error. To record the frame partition process, a binary tree [79] data structure is adopted here. It stores the index of the block that is being split and the position of the partition line during each step of the block partition. The first block that the binary tree block partition starts with is the whole frame. This is the root and the entry of the binary tree. The process of block partition is carried out repeatedly in the following steps: Step 1: Find the block in the binary tree with the maximum value of SSD.

127 For P frame, the value of SSD is the difference between the current block and the block in the reference frames. For B frame, the value of SSD is the minimum of the two SSDs with the two reference frames. Step2: Split this block to generate two children blocks using the method described before. Add the two new blocks to the binary tree as the children of the block that they are split from and store the corresponding MVs. Step3: Check whether the target number of blocks is reached. If not, go to step 1 for the next block partition. Figure 5.2 illustrates a general process of the binary tree block partition.

1

1

2

1

3

2

3

1

1 4

3

2

4 5

3

6

7

3

2

3

4

5 5

6

5 7

Figure 5.2 Partition of a block using binary tree.

5.1.3 Rate-Distortion Control

The Lagrangian cost function [78, 80] defined in current video coding standards is adopted here to evaluate the cost at different distortion and bit rate.

128 J = D+λ⋅R

(5.2)

where D is the distortion which is evaluated by MSE. R is the number of bits for encoding the frame. The control of the tradeoff between the bit rate and distortion is decided by the value of λ. Since P and B frames are generated from motion compensation of the reference frames, their quality is influenced by the reference frames quality. Thus different values of λ are given to I, P, and B frames. The λ for P frame is denoted with λp and the λ for B frame is denoted with λB. In the design of the coding system, for a P frame its target PSNR is 0.7dB less than that of its previous I frame and a B frame’s target PSNR is 0.3dB less than its closest P frame. In the encoder, I frames are encoded with the H.264 intra-coding techniques which uses the user-defined quantization parameter (QP) [81] to control the quality. In H.264 standard, QP has the value between 0~51 and a lookup table is used to decide the quantization step size according to different value of QP. The smaller the value of QP, the finer the quantization and the better the compressed frame quality. Thus in the coding system, QP is used as a controller for adjusting the quality of P and B frames in a video sequence. Initializing and updating of λ

λp and λB need to be initialized before encoding a video sequence. The initialization of λp starts with a predefined value, for example 100. The first P frame is encoded and the resulting compressed frame quality is compared with the target PSNR. The value of λp is adjusted accordingly and the P frame is encoded again using the updated λp. The process is repeated until the maximum number of times, for example 10,

129 is reached, or until a value of λp that satisfies the target PSNR is reached. The initialization of λB is carried out in the similar way with a larger initial value, for example 200, and a target PSNR 0.3dB less than that of P frames. Since the video content and its characteristics change as time passes by, the value of λ need to be updated accordingly to remain the relative quality relationship between frames. After the encoding of each P frame, the values of λp and λB are updated. 5.1.4 Entropy Coding and Transform Coding

A range coder is adopted in the video codec for entropy coding, which is an implementation of arithmetic coding based on integers. The method is carried out almost in the same way as arithmetic coding except that it uses a histogram instead of a probability model. The residual images are encoded by using matching pursuit (MP) [82-84]. In matching pursuit technique, atoms are placed at the position where they provide the greatest reduction in error. This is especially suitable for residual images which usually have a large percentage of pixel values close to zero and localized signals around moving objects’ edges. Comparing with DCT, MP requires relative fewer bits to approximate the largest error signals, thus it performs especially well for low bit rate videos. 5.1.4.1 Arithmetic Coding

For Huffman coding [85, 86], the code words are of integer length. In the case when the probability of a symbol is not a negative power of two, Huffman coding is not efficient and can not achieve the maximum compression ratio.

130 Arithmetic coding [87, 88] overcomes this problem which uses code words of non-integer length, thus it generally can achieve a higher compression ratio. The idea behind arithmetic coding is to have a probability line within the range [0, 1). Based on the probability of a symbol, it is assigned a range on this line. The arithmetic coding is implemented in a sequential way, symbol by symbol. After the encoding process, one code word is generated for the whole input symbol stream. A brief description of the process of arithmetic coding is given in appendix A. A variant of arithmetic coding is range coding [89, 90] which mathematically is much similar to arithmetic coding. Instead of using an interval within [0, 1) as in arithmetic coding, range coding uses a larger integer range to represent a sequence of symbols. In addition, a re-normalization operation is adopted in Range coding for speed improvement with marginal degradation of the coding performance. The realization of range coding is an open source code and can be found from various resources, such as internet. 5.1.4.2 Matching Pursuit (MP)

Matching pursuit is an efficient method for image encoding. In recent years, it has been used widely in video processing for encoding residual information and has been applied to a wide variety of fields including pattern recognition, video/audio encoding, medical imaging, etc. Matching pursuit was first put forward by Mallat and Zhang [91]. It is an iterative approach for approximating a signal based on a dictionary which is usually chosen to be over-completed. A greedy optimization strategy is employed during each

131 iteration of the method. The combination of dictionary function and weight that provides the greatest error reduction is chosen for the signal representation. Matching pursuit (MP) [91, 92] algorithm gives a sub-optimal solution to the problem of an adaptive approximation of a signal on a redundant set (dictionary) of functions. In matching pursuit, a dictionary consists of basis functions at various scales. One commonly used dictionary is a set of Gabor functions [93]. For matching pursuit coding, a picture is approximated in the following steps. First, a matching is performed between the image and the atoms which are derived from the MP dictionary by translating and multiplying the basis functions by weighting factors. During the matching process, an atom can be translated to any location within the image. Second, the atom which has the minimum residual energy when it is subtracted from the image is found. Encode the index of the atom’s basis function in the dictionary, the weighting factor, and the translation vector. Finally, subtract the weighted and translated atom from the image, and repeat the process on the residual image. The process continues until the energy of the residual image is below a threshold or the total number of the coded atoms reaches a target number. The key aspect of matching pursuit is that the video signal can be represented with fewer coefficients than with an orthogonal transform, due to the over-completed set of basis functions. This is especially true for low bit rate video coding since it requires relative fewer bits to encode error signals [94].

132 5.2

Binary Tree Block Partition with Adaptive Search Area Adjustment 5.2.1 Method Overview

For the coding system proposed by Servais [77], the value of the search parameter p is fixed for motion estimation, thus the search window has the fixed side size of (2p+1). However, from our observation, blocks of large size tend to have small motion. This may be attributed to the following reasons. In a scene, the background usually keeps still or moves much slower than the foreground objects. A block with large size is more possible to contain a large percentage of background area and thus, is more possible to have a small MV resulting from block-matching. In this case, to reduce the computation for motion estimation, a small value of search parameter for large block is preferred. From the discussion above we can see that if we can adjust the search parameter value in a flexible way, computation for motion estimation may be reduced with little sacrifice of performance. Since the area of the search window is proportional to the second order of its side size which is decided by the search parameter p, a straight way to adjust the value of search parameter is to adjust it according to the block area. Here, the pre-defined value of p is selected from the commonly used values for macro-block motion estimation. In our method, we adaptively decide the value of the search parameters for each block according to the following rules. First, we define the symbols used in the rules: Sm: the area of a macro-block, which is 16×16. P: the pre-defined search parameter value. Bw: the width of the current block being split.

133 Bh: the height of the current block being split. Sb: the area of the current block, which is Bw×Bh. Pb: the value of the search parameter for the current block. if Sb ≤ Sm Pb = P else R1 = P * sqrt(Sm/Sb) Pb = min{round(R1), P)} end

The method is based on the following idea. If the block’s area is smaller than the area of a macro-block, the pre-defined value of the search parameter is used. When the block’s area is larger than the area of a macro-block, for example if the block’s area is N times of that of a macro-block, since a area is proportional to the second order of its side, the search parameter Pb is changed to the value of the square root of (1/N) multiplying the value of P. If the computed value of the search parameter is not an integer, it is rounded to its nearest integer value. For example, for a block of size 32×64, and P = 16 N = (32*64)/(16*16) = 8, Pb = R1 = round(16*sqrt(1/8)) = 6 For a block of size 8×64, and P = 16 N = (8*64)/(16*16) = 2, Pb = R1 = round(16*sqrt(1/2)) = 11

134 To further reduce the search points and save the computation during the motion estimation, we also propose another method which can reduce the search parameter value even faster than the method mentioned above. In the remaining parts of this chapter, the method mentioned above is denoted as method1 and the method described below is mentioned as method2 which is described in pseudo code as Method2: if Sb ≤ Sm Pb = P else R1 = P* (Sm/Sb) If Min{Bw, Bh}1) y = (i-1)*bsize+1; % Row of image x = (j-1)*bsize+1; % Column of image rmb_y = y+ mvy(i,j); rmb_x = x+ mvx(i,j); block_C = image_C(y:y+bsize-1,x:x+bsize-1); D1 = img_diff(block_C, ... image_R(rmb_y:rmb_y+bsize-1, rmb_x:rmb_x+bsize-1), bsize); rmb_x = x + KF(end).x(1,:); rmb_y = y + KF(end).x(2,:); if (floor(rmb_y) == rmb_y && floor(rmb_x) == rmb_x) % for integer values if (rmb_x dx || rmb_y dy) temp_block = expend_block(image_R,rmb_y,rmb_x,bsize); D2 = img_diff(block_C, temp_block, bsize); else D2 = img_diff(block_C, ... image_R(rmb_y:rmb_y+bsize-1, rmb_x:rmb_x+bsize-1), bsize); end

199 else % For non-integer values block_comp=motion_comp_one_block(image_R,i,j,KF(end).x(1,:),... KF(end).x(2,:),bsize); block_comp=ceil(block_comp); D2 = img_diff(block_C, block_comp, bsize); end array_d1(i,j)=D1; array_d2(i,j)=D2; % In case D1=D2=0 if (D1 == 0 && D2 == 0) kal_q = 0.5; kal_r = 0.5; else if (ave_std T1 && ave_std T2 && ave_std T3 && ave_std T4 && ave_std T5 && ave_std 255 || min(min(frame_fil_comp))