Selective Gray-Coded Bit-Plane Based Low-Complexity ... - IEEE Xplore

12 downloads 0 Views 2MB Size Report
in consumer electronics area mainly thanks to its highly efficient hardware and software implementations. However, these low bit-depth representation based ...
76

IEEE Transactions on Consumer Electronics, Vol. 62, No. 1, February 2016

Selective Gray-Coded Bit-Plane Based Low-Complexity Motion Estimation and its Hardware Architecture Seda Yavuz, Anıl Çelebi, Member, IEEE, Muhammad Aslam, Oğuzhan Urhan, Member, IEEE Abstract — Today, many consumer electronics devices have successor HEVC [2] (High Efficient Video Coding) standards

video capturing capability which is one of the most time, power and memory consuming application. Motion estimation (ME) is the key part of the video coding process in terms of computational load. Thus, it is important to implement this process in a resource efficient way without degrading the encoding quality and real-time operation performance. Low bitdepth representation based ME methods draw a lot of attention in consumer electronics area mainly thanks to its highly efficient hardware and software implementations. However, these low bit-depth representation based methods generally assume that the low bit-depth images are already available. Furthermore, these methods simply neglect the binarization cost which is not a proper approach when whole encoding architecture is of concern. This paper presents a novel selective Gray-coding based ME method and its hardware architecture with an embedded system integration by making use of one of the most common interconnect architecture in consumer electronics devices. Experimental results show that it is possible to reduce computational load of binarization stage significantly while improving the ME accuracy by the proposed approach compared to methods at the same category1. Index Terms — Motion estimation, Gray-coding, One-bit transform, Low-complexity ME.

I. INTRODUCTION The number of devices having video capturing capability is increasing every day. Especially, smart phones and tablets are extensively used to capture and share video data. It is obvious that, efficient compression methods are needed to store these videos in a limited memory. Additionally, transmission of captured raw video requires compression as well, for utilizing available network bandwidth efficiently. Starting from the introduction of first video coding methods, motion compensated hybrid coding approach has been extensively utilized. Today, H.264/AVC [1] and its 1 S. Yavuz, A. Çelebi and M. Aslam are with Kocaeli University Integrated Systems Laboratory (KUTSAL), Electronics and Telecom. Eng. Dept., Umuttepe Campus, 41380, İzmit/Kocaeli, Turkey (e-mail: [email protected]). O. Urhan is with Kocaeli University Laboratory of Embedded and Vision Systems (KULE), Electronics and Telecom. Eng. Dept., Umuttepe Campus, 41380, İzmit/Kocaeli, Turkey (e-mail: [email protected]).

Contributed Paper Manuscript received 12/31/15 Current version published 03/30/16 Electronic version published 03/30/16

are also developed based on the same concept where intraframe redundancies are exploited by making use of intra prediction and transform coding whereas block-based motion estimation techniques are employed to take advantage of temporal redundancies. Statistical redundancies are exploited by entropy coding techniques such as CALVC (Contextadaptive variable-length coding) and CABAC (Contextadaptive binary arithmetic coding). It is important to note that the ME part is generally the most time consuming stage in a video encoder [3]. In the block-based ME, each frame is divided into non-overlapping blocks and each block in current frame is searched around a wider area of the same location in reference frame/s which is referred to as search window. Sum of squared differences (SSD) or sum of absolute differences (SAD) criterion is utilized to decide similarity between the original and candidate blocks. Since the current block is searched in all possible candidate locations within the search range, the computational complexity of this process is quite high. This method is referred to as full-search (FS) based ME because all the candidate locations are checked. There are several group of approaches in the literature to reduce computational load and the hardware complexity of the full search based ME method. The main motivation of the first group of approaches is to check only a sub-set of all candidate locations in search window. Three-step search [4], diamond search [5] and hexagonal search [6] based ME methods are members of this category where only pre-defined search locations are checked. Adaptive search range determination based approaches such as the method presented by Lee et al [7] can be put into this group as well where only limited number of candidates are checked based on a pre-decided search range for each block. The second group of approaches propose to reduce the number of pixels utilized for computing the matching criterion by making use of a specific sub-sampling pattern such as quarter [8], quincunx [9], 8-Queen [10] and reconfigurable boundary [11]. The third group targets to skip computation of matching criterion for specific or all remaining candidate location/s. For example, successive elimination algorithm (SEA) based methods such as the approach presented by Li et al [12], compute the lower bound of the matching criterion at lower

0098 3063/16/$20.00 © 2016 IEEE

S. Yavuz et al.: Selective Gray-Coded Bit-Plane Based Low-Complexity Motion Estimation and its Hardware Architecture

complexity and thus skip impossible candidates before computing matching for this candidate. Early termination methods such as the scheme proposed by Yang et al [13] aim same target by checking partial matching result against to the lowest matching error currently available. Thus, it becomes probable to eliminate impossible candidate locations without computing full matching criterion for the related current block. The last group of approaches [14]-[23] propose to utilize lower complexity matching criteria compared to SSD or SAD. These methods generally referred to as bit plane matching (BPM) based techniques where, image frames are represented in lower bit-depth and Boolean operations are utilized for computing the matching criteria. It is known that the Boolean operations can be effectively carried out in the case of hardware implementations. Since these group of approaches with their hardware implementations are in the focus of this paper, they are discussed in the following chapter in detail. However, it is important to note that, by using BPM based ME methods, it would be possible to increase the efficiency of the single instruction multiple data (SIMD) infrastructure which is available in almost all of the processing resources of current consumer electronics devices. It is also possible to combine different group of methods to further speed up ME process [24]-[31]. The low bit depth representation based approaches explained above are also combined with sparse search [24], [25], early termination [26]-[28] and, adaptive search [29]-[32] based techniques to further speed-up ME process. These approaches might prevent efficient data scheduling in the case of hardware implementation. Another group of techniques propose to perform an additional local search process around the best matching results of BPM based method by making use of SAD criterion [33]-[36]. Since binary nature of the method is degraded, these kind of approaches are not suitable for efficient hardware implementations. A novel binarization technique and its hardware is proposed in this paper. The ME method presented, benefits from the easy binarization and efficient pixel representation properties of Gray coding by constructing a single bit plane in a novel selection scheme. Thus, the proposed method provides superior ME performance compared to many existing multibit depth BPM based ME methods. The hardware architecture developed for proposed ME method can operate in real-time with no on-chip memory requirement for current and reference block. II. LOW BIT-DEPTH REPRESENTATION BASED ME METHODS In full search block based motion estimation approaches, image frames are divided into non-overlapping blocks and each block is searched within a search window in reference frame. Let I c and I r show current and reference image frames, respectively then, motion vector of a certain block of size N×N

77

pixels can be decided as follows: N 1 N 1

SSE  m, n    I c  i, j   I r  i  m, j  n  , 2

i 0 j 0

(1)

 s  m, n  s

where (m,n) denotes candidate motion vector, s determines the search range. The candidate motion vector giving the lowest matching error (SSE) is assigned as the motion vector of the block. As briefly described in previous section, checking all possible candidate locations in search window using SSD or SAD based matching criteria as in (1) causes a significant computational burden. Low bit-depth representation based methods aim to utilize low complexity matching criteria by reducing number of bits/bit-planes used to represent image frames. Thus, overall complexity of the motion estimation can be reduced. Feng et al [14] presented bit-plane matching based motion estimation as a preprocessing step to speed-up overall motion estimation process where block mean ( Tbm ) is utilized as a threshold for constructing binary image frames. In the method presented by Natarajan et al [15], image frames are initially filtered by making use of a multi-band pass filter and then the filtered image frame is compared against to filtered image frame to determine binary representation of the input frame. After the binarization step, motion vectors are decided based on the number of nonmatching points (NNMP) criterion as follow: N 1 N 1

NNMP  m, n     B c  i, j   B r  i  m, j  n  i 0 j 0

(2)

where B c and B r show binary form of the current ( I c ) and reference frame ( I r ) obtained by comparing original image frame against the filtered frame and  denotes Boolean EX-OR operation. The candidate location giving the lowest NNMP value is decided as the motion vector of the block. Natarajan et al [15] also presented a hardware architecture to illustrate the effectiveness of this matching criterion. However, the cost of the binarization process is not assessed in this worked. It is assumed that they are already available for block matching process. It is important to note that this method (i.e. one-bit transform - 1BT based ME) requires a total of 25 integer addition, 1 real division and 1 comparison operations per pixel to obtain corresponding binary image frame. A diamond shaped binarization kernel which avoids real division operation is proposed by Ertürk [16]. This kernel includes 16 non-zero components and thus normalization operation is carried out by making use of integer shift operation. This method is referred to as multiplication-free one-bit transform (MF-1BT) based ME and it requires all integer 16 additions, 1 four-bits shift and 1 comparison operations per pixel through the binarization process. It is shown by Ertürk [16] that, the MF-1BT is able to provide similar motion estimation accuracy compared to 1BT based ME [15].

78

IEEE Transactions on Consumer Electronics, Vol. 62, No. 1, February 2016

g7

g6

g5

g4

g3

g2

g1

g0

Fig. 1. Gray-coded bit-planes of Foreman frame #8.

Ertürk et al [17] proposed to utilize two bit-planes for ME process. The first bit-plane is constructed similar to the approach presented by Feng et al [14], whereas the second one is computed by utilizing mean and standard deviation of a larger block around current block. The binarization cost of this method is significantly high as illustrated in Section V. Another two bit-planes based representation is proposed by Urhan et al [18] where the first bit-plane is computed as in MF-1BT based ME and the second is constructed as a constraint mask to decide the pixels that are reliable enough to include matching criterion. The matching criterion (constrained NNMP) of this approach also requires three Boolean operation similar to the 2BT based method. Gray-coded bit-plane matching (GCBPM) for global motion estimation is proposed by Ko et al [21] and then it is applied to motion estimation for video coding purpose by Urhan et al [22]. The K-bit Gray code of a pixel value can be computed

that only the 3 most significant bit planes are utilized in matching process. The method presented by Çelebi et al [23] called as T-GCBPM based ME and it outperforms 1BT, MF1BT, 2BT and C-1BT based approaches mainly because of three bit-planes utilized similar to [20]. Kuo et al [37] proposed to utilize an interlaced Gray-coding pattern to obtain a single bit-plane for global motion estimation purpose. This approach enables lower complexity binarization compared to the other low complexity ME methods except the T-GCBPM since a selection operation is required for interlacing process. Our experiments revealed that, this method has a similar ME accuracy compared to the 1BT based ME when it is applied to video coding. In this paper, a novel selection and placement scheme for Gray-coded bit-planes to further improve the ME accuracy compared to the method presented by Kuo et al [37] is proposed. After the proposed binarization process, (2) is employed to decide motion vector. III. PROPOSED BINARIZATION APPROACH

g K 1  aK 1 g k  ak  ak 1 , 0  k  K  2

(3)

where a shows natural binary code of pixel values. The matching criterion (MC) for [21] is similar to (2) with a fixed Gray-coded bit-plane. On the other hand, the MC for the method presented by Çelebi et al [23] is computed as N 1 N 1 K 1

MC  m, n   



i  0 j  0 k  NTB

2k  NTB  g kc  i, j   g kr  i  m, j  n 

(4)

where NTB denotes number of truncated bits. It is shown that the best ME results are obtained when NTB=5 which means

As described in the previous section, at the first step of BPM based ME methods it is required to convert full bit depth image frames into lower bit-depth representation. Then, motion estimation is performed by making use of a suitable matching criterion and search range. The main advantages of the BPM based ME methods originate from their higher speed, smaller footprint for area and power in hardware implementation. As shown in many recent works [38]-[43], efficient hardware architectures are presented in the literature for BPM based ME methods. However, the cost of binarization and its hardware cost in the case of 1BT, MF1BT, 2BT, C-1BT, WC-1BT based ME is neglected. The T-GCBPM based method has a significant advantage since the binarization can be implemented by making use of simple EX-OR operations or look-up tables (LUTs).

S. Yavuz et al.: Selective Gray-Coded Bit-Plane Based Low-Complexity Motion Estimation and its Hardware Architecture

As described in Section II, Gray-coded bit-plane matching based methods [21], [23] propose to employ a pre-selected single or several bit-planes, respectively. Fig. 1 shows eight gray-coded bit-plane of an image frame from the Foreman sequence. As seen from this figure, higher bit-planes contains most of the information available in the original frame. However, when a single Gray coded bit-plane is evaluated, it does not provide enough information about the original frame. Since the method in [21] utilizes only a certain Gray coded bit plane its ME performance for different image contents may not be adequate. However, because of the single bit-plane utilized, the overall computational complexity at the matching stage will be lower. On the other hand, the T-GCBPM based method employs the 3 most significant bit-planes (i.e. g7, g6, g5) to represent images and thus provides better performance with additional computation complexity in matching stage. In this paper, we propose a novel combination of the methods presented by Çelebi et al [23] and Kuo et al [37] to construct a single bit-plane for each candidate positions which contains Gray coded pixel values from the 3 most significant bit-planes. By the proposed selection and placement of the 3 most significant bits of pixel Gray-code to construct single bit-plane for matching, it becomes possible to exploit advantages of both methods. Note that, the proposed method utilizes a different bit-plane selection and placement scheme for each candidate location compared to the method presented by Kuo et al [37] where 4 bits are utilized in an interlaced fashion as shown in Fig. 2. In this paper, we present a novel bit-plane selection and placement scheme which improves ME accuracy compared to [37]. The bit-plane selection approach proposed in this paper is shown in Fig. 3 for a 16×16 image block. Note that we construct binary image blocks for each candidate location separately. The related works in GCBPM based ME [23],[43] show that the contribution of the five least significant bitplanes to ME accuracy is limited compared to the most significant 3-bit planes.

Fig. 2. Bit-plane selection approach presented by Kuo et al [37].

79

Fig. 3. Proposed bit-plane selection approach for a 16×16 block.

Thus, we prefer not to include g4 into our selection scheme. Additionally, distributed placement of bit-planes compared to the method presented by Kuo et al [37] enables accurate matching since the distance between selected bit-plane positions are increased for neighbor pixels. Our experiments show that the proposed bit-plane selection and placement approach is able to improve ME accuracy of the method proposed by Kuo et al [37]. IV. HARDWARE ARCHITECTURE Low complexity ME methods are suitable for hardware implementation as presented in the literature. Compared to the hardware architectures proposed for SAD based ME methods they are expected to be occupy smaller area on the chip at least several orders of magnitude since only several bit planes are utilized in BPM based ME methods. The power consumption and memory requirements of the BPM based are also expected to be lower compared to the that of SAD based ME hardware architectures. The hardware architecture proposed for the BPM based ME method developed in this work is shown in Fig.4. Spiral search scheme is utilized as the search method to allow further extension of the architecture to be able to perform early termination or adaptive search range techniques as shown in Fig. 5. The main components of the architecture are the current block register array, search window register array, 2D processing element (PE) array, parallel counter and comparator. Note that, controller part is not shown in the architecture since it is not an essential part in the proposed architecture. The most important building block of the proposed architecture is the MUX array placed between register array and 2D PE array since the novel selection scheme is implemented by this block. Current block register array and search window register arrays are composed of flip flops with three and four direction shifting capabilities similar to the architecture presented by Celebi et al [43]. Since 3 bit planes are needed for selection process 3 register arrays are utilized for both current block and search window.

80

IEEE Transactions on Consumer Electronics, Vol. 62, No. 1, February 2016

stage of addition for one pixel is ignored [43]. The last stage is comparator where comparison operation is performed and motion vectors of candidate block with minimum NNMP are generated. V. EXPERIMENTAL RESULTS

Fig. 4. Proposed hardware architecture

30

29

28

27

26

25

Shift register rotates upwards

31

12

11

10

9

24

Shift register rotates left

32

13

2

1

8

23

Shift register rotates downwards

33

14

3

0

7

22

Shift register rotates right Data flows to the reverse direction compared to the registers' routing direction.

34

15

4

5

6

21

35

16

17

18

19

20

Fig 5. Spiral search diagram

According to the proposed selection scheme a 3 to 1 multiplexor is needed to construct the bit plane that is going to be utilized in matching process. This functionality is implemented by the 3×1 MUX array as shown in Fig. 4. 16×16 center block of search window register array is sent to 2D PE array after appropriate bit selection is performed for each pixel by the MUX array of size 16×16 as shown in Fig. 4. In 2D PE array, reference block and current block are compared by using Boolean exclusive or (XOR) operation. Parallel counter is utilized to calculate the number of the nonmatching pixels (NNMP) metric for each candidate location. Parallel counter is composed of seven stages of sub parallel counters of size 3|2, 7|3, 15|4, 31|5, 63|6, 128|7 and, 255|8 respectively. Each macroblock contains 256 pixels but parallel counter has 255 inputs. It is shown by experiments that the absence of one pixel in the NNMP computation does not affect the ME performance because of that to reduce the complexity one

In general, an open loop evaluation approach where initially the current image frame is estimated from the previous one and then of Peak Signal to Noise Ratio (PSNR) between the original and estimated frames is utilized to assess estimation performance of low bit-depth based ME methods. It might be possible to integrate these methods into a full encoder to see its effect on overall coding performance. However, in this case, it may not be possible to evaluate performance of only ME method since other components of the encoder will also affect the coding performance. We are planning to investigate encoder implementation of the proposed method as a future work. In order to focus performance of ME part we have decided to utilize open loop scheme similar to most of the low-bit depth based ME literature. Table I shows PSNR results in dB for six different sequences displaying different motion characteristics. The block size and search window are set to 16 for the result given in this table. For a complete comparison among the methods falling into same category, ME results of the 1BT [15], 2BT [17], MF-1BT [16], C-1BT [18], GCBPM [22], T-GCBPM [23] based methods are also given. Additionally, ME results when a single Gray-coded bit-plane is utilized is provided as well to show advantage of proposed selective Gray-coded bit-plane based method. As seen from the Table I, when a single Gray-coded bit-plane is utilized, its ME performance significantly depends on the selected bit-plane and image frame characteristic. For example, in single Gray-coded bit-plane case, the best ME performance is obtained from the 7th bit-plane for Football sequence, whereas 5th bit-plane provides the best results for Coastguard sequence. Thus, it is not reasonable to utilize a single bit-plane to represent different type of image frames efficiently. In order to assess performance of proposed selective Gray-coding based method, together with the selection scheme presented by Kuo et at [27], we also investigate an additional selective Gray-coding based configuration. As described in the previous section, the method presented by Kuo et at [37] utilizes pixels from four different bit planes (g7,g6,g5,g4) in regular fashion to construct a single bit-plane as shown in Fig. 2. In the second configuration (regular selection test pattern), pixels from three different bit planes (g7,g6,g5) are utilized and the first column contains only the pixels coming from the 7th bit-plane while 2nd and 3rd columns have pixels from the 6th and 5th bit-planes, respectively. In the case of proposed configuration, g7, g6, g5 bit-planes are utilized in a checkerboard style which enables better ME accuracy than the regular selection test pattern mainly because of the distributed utilization of the different bit-planes.

S. Yavuz et al.: Selective Gray-Coded Bit-Plane Based Low-Complexity Motion Estimation and its Hardware Architecture

81

TABLE I. PSNR PERFORMANCE (IN DB) OF DIFFERENT LOW COMPLEXITY ME METHODS IN OPEN LOOP SCHEME

Method SAD (8-bit depth) 1BT [15] MF-1BT [16] 2BT [17] C-1BT [18] GCBPM [22] T-GCBPM [23] Gray Coding 7th Bit Plane Gray Coding 6th Bit Plane Gray Coding 5th Bit Plane Gray Coding 4th Bit Plane Interlaced Gray-coding [37] Regular Selection Test Pattern elective Gray-coding (Proposed)

Football

Video Sequences (Frame Size, Sequence Length) Foreman Tennis Flowergarden Mobile Coastguard

( 352  240 ) 352  288 ) ( 352  240 ) (125 frames) 00 frames) (150 frames)

22.88 21.83 21.81 22.06 22.10 21.87 22.59 21.66 20.79 20.31 19.54 21.94 22.09 22.24

32.09 30.32 30.38 30.70 30.86 30.96 31.32 28.46 27.92 29.27 28.70 30.92 30.62 31.03

29.45 28.11 28.18 28.46 28.71 28.24 28.78 27.49 27.34 27.44 26.52 28.47 28.46 28.69

( 352  240 ) (115 frames)

23.79 23.31 23.26 23.43 23.38 23.26 23.67 23.26 22.56 22.53 20.35 23.17 23.29 23.38

( 352  240 ) ( 352  288 ) (300 frames) (300 frames)

23.94 23.61 23.63 23.66 23.69 23.51 23.81 23.28 22.42 21.25 20.48 23.18 23.33 23.47

30.48 29.83 29.88 29.94 29.98 29.78 30.16 26.56 27.84 29.05 28.23 29.79 29.38 29.85

Average of six video sequence 27.11 26.17 26.19 26.38 26.45 26.27 26.72 25.11 24.81 24.98 23.97 26.25 26.22 26.44

TABLE II. NUMBER OF OPERATIONS REQUIRED FOR THE LOW-COMPLEXITY ME APPROACHES

ME Approach 1BT [15] MF-1BT[16] 2BT [17] C-1BT [18] T-GCBPM [23] I-GCBPM [37] Proposed

Transform Matching Addition Multiplication Shift Subtraction Comparison Boolean Op. Boolean Op. Shift Addition (pp) (pp) (pp) (pp) (pp) (pp) (pp) (pp) (pp) 25 1 1 1 16 1 1 1 2.8125 1.0625 0.03125 3 1 3 16 1 1 2 3 2 3 3 3 1 4 2.5 1 1 5.6 2 1 -

It is also important to note that the proposed selection approach provides 0.2dB better results in terms of PSNR on average compared to method presented by Kuo et al [37] which also means that the contribution of 4th Gray-coded bit-plane may not be positive since it might contain some noisy binarization results. When the performance of the proposed selective Gray coded based bit-plane method is compared against to other single bitplane based methods such as 1BT and MF-1BT, the proposed method outperforms them around 0.3dB on average. When we compare the proposed method against to the methods that use two bit planes such as 2BT and C-1BT, the proposed method provides similar or better ME performance in most of the sequences. Computational complexity of different methods is shown in Table II. As seen from this table, the proposed method has significantly lower complexity compared to 1BT, MF-1BT, C1BT and 2BT based approaches while providing similar or better performance. Since both binarization and matching stages of the proposed method is computationally lightweight, it is suitable for efficient hardware and software implementations in mobile devices having limited computational and battery power.

Proposed hardware architecture is implemented on 28nm FPGA device. According to the synthesis results the proposed architecture occupies 8747 LUTs and 7864 DFFs that is the 6.5% and 2.92% of the total available resources of the target FPGA device, respectively. The power and timing performance of the proposed hardware architecture is also performed to evaluate its efficiency compared to the previously proposed architectures. Power analysis is performed with two different clock frequency. Table III shows the power analysis’s results at clock period 20ns and 10ns respectively. 3 different motion characteristics are used to perform a fair comparison between the power consumption of the previously proposed architectures. There is no need for a dedicated memory for both current block and search window thanks to the register array. Since dedicated memories occupy smaller physical area it seems better to use these components as a memory. However, these blocks do not let a four-way movement which is essential for implementing spiral search scheme. Thus instead of dedicated block RAM resources, DFFs are used in an array like fashion. In the proposed hardware architecture Level-D data reuse scheme has become possible to be implemented with thanks to

82

IEEE Transactions on Consumer Electronics, Vol. 62, No. 1, February 2016 TABLE III. POWER ANALYSIS RESULTS

Power Consumption (mW) 20 ns/10 ns signals logic 10/14 08/12 12/17 10/14 08/ 10 07/09 11/16 10/14 08/12 07/10 06/10 05/08 12/19 11/ 16 09/12 08/ 11 06/10 05/08 9,11/13,3 7,89/11,3

Motion Vectors hv_x -3 3 4 8 9 9 -15 14 -12

hv_y -3 2 0 7 -7 9 -11 -14 12 Average

Fig. 6. The data reuse scheme that proposed hardware architecture can implement. Control Signals

Processor System

Control Signals, Motion Vector

Control Signals

Video Stream

Data Stream

DMA IP

Data Stream

Motion Estimation IP

the utilized 4 way shift register array based memory implementation. In Fig. 5 the data reuse concept is illustrated. In [45] a detailed investigation on the impacts of data reuse capability on the total memory bandwidth and thus the power consumption of the ME hardware architectures are performed. 4 levels of data reuse schemes are defined in [45]. According to [45], our architecture has the capability of implementing Level-D data reuse scheme by which off-chip memory bandwidth can be reduced more than 20 times. Thus, a whole video coding system in which the proposed hardware architecture is utilized, a low power consumption can be easily achieved since the power consumption of the core logic is much lower compared to an off chip dynamic memory. According to the Table IV, it is seemed that in terms of occupied number of LUTs, the proposed hardware architecture occupies the largest area but no on chip memory is utilized. Since the proposed hardware architecture has the level–D data reuse capability as stated in [44] it will result the lowest off chip memory bandwidth compared to the other works presented in Table IV. It is important to note that none of the architectures given in this table does not include binarization data-path except the proposed hardware architecture. Thus, they should not be considered as a turnkey solution for the ME method they implement. Video encoders are usually implemented as accelerators connected to the processing system via a bus interconnect in consumer electronics devices to offload the computational load of the encoding process from the processor. By following this approach, we have wrapped the proposed hardware architecture with a common bus interconnect in order to illustrate that it can be easily integrated into a state of the art consumer electronics device. This concept is illustrated in Fig. 7 where DMA block is utilized to provide dense data transfer between sensor and the ME accelerator developed in this work. Once the data is received by the accelerator through a buffer like memory interface it performs the matching process and then informs the processor about the result with an interrupt like interface. It takes 1089 clock cycles for the hardware architecture to compute motion vector except the memory transfer time which is a technology specific parameter.

Fig. 7. Intergrated diagram of proposed hardware architecture TABLE IV. ME PERFORMANCE COMPARISON

Bit depth On chip memory Area Power Maximum frequency Technology Search range Search method

Proposed 3 0 8125 LUTs/7353 DFFs 8,5 mW@50MHz 243 MHz FPGA 28 nm [-16 16] Spiral search

[39] Recompiled 1 24064 1121 LUTs/NAs 35,3 mW@50 MHz 218 MHz FPGA 45nm [-16,16] Full search

[41] 1 4096 3914 LUTs/2517 DFFs NA 192 MHz FPGA 65nm [-16,16] Full search

[44] Recompiled 2 0 5413 LUTs/NA 30,7 mW@50 MHz 275 MHz FPGA 45nm [-1,1] to [-16,16] Spiral search

S. Yavuz et al.: Selective Gray-Coded Bit-Plane Based Low-Complexity Motion Estimation and its Hardware Architecture

VI. CONCLUSIONS In this paper, a selective Gray-coded bit-plane based binarization approach for low complexity motion estimation with its hardware architecture is presented. The proposed BPM based ME method outperforms single bit-plane based methods existing in the literature while providing similar or better performance than the methods utilizing two bit-planes. It is important to note that selective Gray-coded bit-plane based method has the lowest binarization cost among the compared methods except the conventional Gray coded BPM methods. The proposed binarization approach is efficiently implemented in hardware. It is shown that the architecture proposed is suitable for seamless integration into state of the consumer electronics devices by making use of a common bus interconnect. Experimental results revealed that the proposed architecture is capable of providing data reuse to reduce both off chip data access time and power consumption dramatically. REFERENCES [1]

[2] [3]

[4] [5] [6] [7]

[8] [9] [10]

[11] [12] [13] [14]

Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, Mart, 2003, "Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 1449610 AVC)", JVT-G050. ISO/IEC 23008-2:2013, High efficiency coding and media delivery in heterogeneous environments -- Part 2: High efficiency video coding, International Organization for Standardization. 2013-11-25. T.C. Chen, Y.H. Chen, S.F. Tsai, S.Y. Chien, L.G. Chen, “Fast algorithm and architecture design of low-power integer motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 5, pp. 568-577, May 2007. T. Koga, K. Linuma, A. Hirano, Y. Lijima, T. Ishiguro, “Motion compensated interframe coding for video conferencing,” in Proc. Nat. Telecommun. Conf., C9.6.1–C9.6.5., 1981 S. Zhu, K.K. Ma, “A new diamond search algorithm for fast blockmatching motion estimation,” IEEE Trans. Image Process., vol. 9, no. 2 pp. 287-290, Feb. 2000. C. Zhu C., X. Lin L.P. Chau, “Hexagon-based search pattern for fast block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 5, pp. 349-355, May 2002. J. Lee, M. Choi, Y. Cho, J. Kim, W.K. Cho, “Fast H.264/AVC motion estimation algorithm using adaptive search range,” in Proc. 12th International Symposium on Integrated Circuits, (ISIC '09); Singapore, pp. 336-339, Dec. 2009. M. Bierling “Displacement estimation by hierarchical block matching,” in Proc. SPIE Conference on Visual Communications and Image Processing; San Jose, CA, USA, pp. 942–951, Oct. 1998. K. Lengwehasatit, A. Ortega, “Probabilistic partial-distance fast matching algorithms for motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 2, pp. 139-152, Feb. 2001. C.N. Wang, S.W. Yang, C.M. Liu, T. Chiang, “A hierarchical n-queen decimation lattice and hardware architecture for motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 429-440, Apr. 2004. A. Saha, J. Mukherjee, S. Sural, “New pixel-decimation patterns for block matching in motion estimation,” Signal Process.-Image Commun., vol. 23, no. 10, pp. 725-738, Oct. 2008. W. Li, E. Salari, “Successive elimination algorithm for notion estimation,” IEEE Trans. Image Process., vol. 4, no. 1, pp. 105-107, Jan. 1995. L. Yang, K. Yu, J. Li, S. Li, “An effective variable block-size early termination algorithm for H.264 video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 6, pp. 784-788, June 2005. J. Feng, K.T. Lo, H. Mehrpour, A.E. Karbowiak, “Adaptive block matching motion estimation algorithm using bit plane matching,” in Proc. of IEEE Int Conf. on Image Processing (ICIP), Washington DC, USA. pp. 496–499, Oct. 1995.

83

[15] B. Natarajan, V. Bhaskaran, and K. Konstantinides, “Low-complexity block-based motion estimation via one-bit transforms,” IEEE Trans. Circuit Syst. Video Technol., vol. 7, no. 4, pp. 702-706, Aug. 1997. [16] S. Ertürk, “Multiplication-free one-bit transform for low-complexity block-based motion estimation,” IEEE Signal Process. Lett., vol. 14, no. 2, pp. 109-112, Feb. 2007. [17] A. Ertürk and S. Ertürk, “Two-bit transform for binary block motion estimation,” IEEE Trans. Circuit Syst. Video Technol., vol. 15, no. 7, pp. 938- 946, July 2005. [18] O. Urhan and S. Ertürk, “Constrained one-bit transform for lowcomplexity block motion estimation,” IEEE Trans. Circuits and Syst. Video Technol., vol. 17, no.4, pp. 478-482, Apr. 2007. [19] C. Choi, J. Jeong, “Enhanced two-bit transform based motion estimation via Extension of matching criterion,” IEEE Trans. Consum. Electron., vol. 56, no. 3, pp. 1883-1889, Aug. 2010. [20] M.K. Güllü, “Weighted constrained one-bit Transform based fast block motion estimation,” IEEE Trans. Consum. Electron., vol. 57, no. 2, pp. 751-755, May 2011. [21] S.J. Ko, S.H. Lee and K.H. Lee, “Fast digital image stabilizer based on Gray-coded bit-plane matching,” IEEE Trans. Consum. Electron., vol. 45, no. 3, pp. 598-603, Aug. 1999. [22] O. Urhan, S. Ertürk, “Gray coded bit-plane matching for block based motion estimation,” in Proc. of 10th Signal Processing and Communication Applications Conf. (SIU), Pamukkale, Denizli, Turkey. pp. 518-523, June 2002. [23] A. Çelebi, O. Akbulut, O. Urhan, S. Ertürk, “Truncated gray-coded bitplane matching based motion estimation and its hardware architecture,” IEEE Trans. Consum. Electron, vol. 55, no. 3, pp. 1530-1536, Aug. 2009. [24] O. Urhan, “Constrained one-bit transform based motion estimation using predictive hexagonal pattern,” J. Electron. Imaging, vol. 61, no. 3, Article ID: 033019, July-Sep. 2007. [25] E.S. Lee, O. Urhan, T.G. Chang, “Multiplication-free one-bit transform and diamond search combination for fast binary block motion estimation,” in Proc. of IEEE 15th Signal Processing and Communications Applications Conf., Eskisehir, Turkey. pp. 430-433, June 2007. [26] H. Lee, J. Jeong, “Early termination scheme for binary block motion estimation,” IEEE Trans. Consum. Electron., vol. 53, no. 4, pp. 16821686, Nov. 2007. [27] H. Lee, S. Jin, J. Jeong, “Early termination algorithm for 2BT block motion estimation,” Electronics Lett., vol. 45, no. 8, pp. 403-405, Apr. 2009. [28] O. Urhan, S. Ertürk, “Constrained one-bit transform based motion estimation with early skip mode,” in Proc. of 19th IEEE Signal Processing and Communication Applications Conf., Antalya, Turkey, pp. 774-776, Apr. 2011. [29] O. Urhan, “Constrained one-bit transform based fast block motion estimation using adaptive search range,” IEEE Trans. Consum. Electron., vol. 56, no 3, pp. 1868-1871, Aug. 2010. [30] I. Kim, J. Kim, J. Jeong, G. Jeon, “Low-complexity block-based motion estimation algorithm using adaptive search range adjustment,” Opt. Eng., vol. 51, no. 6, Article ID: 067010, June 2012. [31] O. Urhan, “Truncated gray-coding based fast block motion estimation,” J. Electron. Imaging, vol. 22, no. 2, Article ID: 023018, Jun 2013. [32] I. Kim, J. Jeong, “Binary block motion estimation using an adaptive search range adjustment technique,” J. Automation and Control Eng., vol. 4, no. 4, pp. 376-380, Dec. 2014. [33] P. H. W. Wong and O. C. Au, “Modified one-bit transform for motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, no. 7, pp. 1020-1024, Oct. 1999. [34] B. Demir and S. Ertürk, “Block motion estimation using modified two bit transform,” Lect. Notes in Computer Science, vol. 4263, pp. 522-531, 2006. [35] B. Demir and S. Ertürk, “Block motion estimation using adaptive modified two-bit transform”, IET Image Process., vol. 1, no. 2, pp. 215222, June 2007. [36] H.-Y. Oh, D.-H. Kim, O. Urhan, T.-G. Chang, “Modified constrained one-bit transform based fast block motion estimation”, IEEE Trans. Consum. Electron., vol. 53, no. 3, pp. 1093-1097, Aug. 2007.

84

IEEE Transactions on Consumer Electronics, Vol. 62, No. 1, February 2016

[37] T.Y. Kuo, C.H. Wang, “Fast local motion estimation and robust global motion decision for digital image stabilization,” in Proc. Int. Conf. on Intelligent Information Hiding and Multimedia Signal Processing, Harbin, China. pp. 442-445, Aug. 2008. [38] A. Çelebi, O. Akbulut, O. Urhan, I. Hamzaoğlu, S. Ertürk, “An all binary sub-pixel motion estimation approach and its hardware architecture,” IEEE Trans. Consum. Electron., vol. 54, no. 4, Nov. 2008. [39] A. Çelebi, O. Urhan, I. Hamzaoğlu, S. Ertürk, “Efficient hardware implementations of low bit depth motion estimation algorithms,” IEEE Signal Process. Letts., vol. 16, no. 6, pp. 513-516, June 2009. [40] A. Akın, Y. Doğan, I. Hamzaoğlu, “High performance hardware architectures for one bit transform based motion estimation,” IEEE Trans. Consum. Electron., vol. 55, no. 2, pp. 941-949 , May 2009. [41] A. Akın, G. Sayılar, I. Hamzaoğlu, “High performance hardware architectures for one bit transform based single and multiple reference frame motion estimation,” IEEE Trans. Consum. Electron., vol. 56, no. 2, pp. 1144-1152, May 2010. [42] S. K. Chatterjee, “Implementation of weighted constrained one-bit transformation based fast motion estimation,” IEEE Trans. Consum. Electron., vol. 58, pp. 646-653, May 2012. [43] A. Çelebi, H. J. Lee, S. Ertürk, “Bit plane matching based variable block size motion estimation method and its hardware architecture,” IEEE Trans. Consum. Electron., vol. 56, pp. 1625-1633, Aug. 2010. [44] A Celebi, O Urhan “High performance hardware architecture for constrained one-bit transform based motion estimation”- Signal Processing Conference, 2011 19th European, 2011. [45] J. C. Tuan, T. S. Chang, and C. W. Jen, "On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture," IEEE Trans. Circuits and Syst. Video Technol., vol. 12, no. 1, pp. 61-72, Jan. 2002. BIOGRAPHIES Seda Yavuz has been with the Department of Electronics and Telecommunications Engineering, University of Kocaeli, Turkey, where she is student of bachelor degree since 2011. Her current research interests include motion estimation algorithms and their implementations using FPGA.

Anıl Çelebi (S’00, AM’09) was born in Ordu, Turkey. He received the B.Sc., M.Sc. and Ph.D. degrees in electronics and communication engineering from Kocaeli University, Kocaeli, Turkey, in 2002, 2005, and 2008, respectively. Since 2002 he has been with the Department of Electronics and Telecommunications Engineering, University of Kocaeli, Turkey, where he is currently working as an Assistant Professor. He worked as a BK21 Post Doctoral Research fellow at the School of Electrical Engineering and Computer Science at Seoul National University, Korea between April - July 2009. His research interests include very large scale integration (VLSI) design and implementation for analog/mixed signal systems, image processing and video coding systems. Muhammad Aslam was born in Bahawalpur, Pakistan. He received the B.Sc., degree in electronics engineering from International Islamic University, Islamabad, Pakistan, in 2014. Since 2015 he has been with the Department of Electronics and Telecommunications Engineering, University of Kocaeli, Turkey, where he is student of master degree. His current research interests include video coding/motion estimation: algorithm and implementation using FPGA. Oğuzhan Urhan (S’02-M’06) received his B.Sc., M.Sc., and Ph.D. degrees in Electronics and Telecommunication engineering from the University of Kocaeli, Kocaeli, Turkey, in 2001, 2003, and 2006, respectively. Since 2001, he has been with the Department of Electronics and Telecommunications Engineering, University of Kocaeli, Turkey, where he is currently full professor. He was a visiting professor at Chung-Ang University, South Korea, from 2006 to 2007. He is the director of Kocaeli University Laboratory of Embedded and Vision Systems (KULE). His research interests include digital signal, image/video processing and embedded systems.

Suggest Documents