Memory Access Reduction for MIME Algorithm - Semantic Scholar

➠

➡ Memory Accesses Reduction for MIME Algorithm Sumeer Goel, Mohsen Shaaban, Tarek Darwish, Hanan Mahmoud and Magdy Bayoumi Center for Advanced Computer Studies (CACS) University of Louisiana at Lafayette [sxg4999,mmm5554,tkd5171, ham8251,mab] @ cacs.louisiana.edu ABSTRACT Power consumption of digital systems has become a critical design parameter. An important class of digital systems includes applications such as video image processing and speech recognition, which are extremely memory dominant. In such systems, a significant amount of power is consumed during memory accesses. Reducing the number of memory accesses can considerably impact the power dissipation in the rest of the design. Therefore, optimizing an application for reduced memory access can greatly effect the overall power consumption in the entire system. This paper presents an architectural enhancement Multi-Stage Interval-based Motion Estimation (MIME) algorithm that not only saves power by reducing the number of memory accesses but also significantly increases the speedup. 1. INTRODUCTION Motion estimation aims at compressing the amount of data necessary to transmit a video sequence across a bandwidth-limited channel. One popular technique for motion estimation is the block-matching algorithm (BMA) [1]. In this technique, the current image frame is first partitioned into fixed-sized rectangular blocks, and the motion vectors for each block is estimated by finding the closest block of pixels in the previous frame according to a matching criterion. Full-search block-matching (FSBM) algorithm [2] provides optimum performance by searching all the blocks in the search window but proves to be computationally expensive. The power consumption and computational cost of these search algorithms can be reduced at different levels of abstraction. Several cost effective techniques at algorithmic level have been proposed in the literature [3][6][7]. These modifications address the problem of power consumption but compromise on the complexity of the approach and do not address the vital issues related to memory access. Most of them concentrate on data flow within the processing element (PE). However, this approach can cause data flow problems outside the PE.

0-7803-7965-9/03/$17.00 ©2003 IEEE

If the memory has both the current and previous frame in full, the total number of memory accesses (Tmem) required to perform the FSBMA is the sum of the accesses of previous frame and accesses of the current frame, which is expressed as: Tmem = f×W×H×RACcurr + f×W×H×RACprev current frame

(1)

previous frame

where f is the frame rate, W and H are the width and height of frame, RACcurr and RACprev are the redundancy access count (RAC) for both current and previous frames. The RAC is defined to be the ratio of how many times the same pixel is accessed from the memory. Depending upon the architectural implementation, the RAC can vary. If the memory traffic is restricted to one reference block of N2 pixels and one search area of (2w+N)2, then it should read the reference block and candidate block each time it computes the motion vector. In this case, the memory accesses required is given by: Tmem = (W×H +(2w+n)2×W×H/N2)× f

(2)

According to the authors of [8], in a SRAM (CY7C136-133 from Cypress) of size 2M bits that has an access time of 4ns, voltage of 3.3v and current of 375mA, the power consumption per memory access is 4.95 nJ. With this in consideration, an algorithm that frequently utilizes the memory is liable to spend more power than an algorithm that is less memory intensive. In this paper, we present an architectural enhancement to a present BMA architecture to achieve a reduction in the memory access requirements making the architecture faster and powerefficient. The architecture also exploits the use of pipeline stages without compromising on total throughput of the system. In the next few sections, we discuss the different block matching algorithms and their disadvantages. In section 5 we present the proposed architectural enhancements and compare it with the architecture presented in [9].

II - 805

ICME 2003

➡

➡ 2. FULL-SEARCH BLOCK-MATCHING ALGORITHMS

4. MIME ALGORITHM

FSBMA finds the best match for each reference block of size N x N in the current frame within a search area S in the previous frame. The criterion for best match is the candidate block with the minimum amount of distortion when compared with the reference block. The most common one used is the Sum of Absolute Differences (SAD) of intensity values between the two blocks being compared. The SAD for the candidate block of size N x N at position (u,v) can be defined as: N N SAD(u , v ) = ∑ ∑ s (i + u , j + v ) − r (i, j ) i =1 j =1

(3)

where r (i, j ) and s (i + u , j + v) are intensity values at position (i, j ) of the reference block and (i + u , j + v) in the candidate block in search area S. The block matching process generates a motion vector (u, v) min and the corresponding distortion value, SADmin . FSBM is widely used because of its simplicity and regularity, but it needs massive computations and an expensive hardware. 3. FSBMA BASED ON CONSERVATIVE APPROXIMATION FSBMA based on conservative approximation uses conservative approximation [9] of SAD(u , v ) for the estimation of motion vectors. The calculation of the new D(u, v) is less expensive as compared to the conventional SAD(u , v ) in terms of VLSI power consumption. The conservative estimate D (u , v) is defined as: N −1 N N D (u , v ) = ∑ ∑ s ( i + u , j + v ) − ∑ r (i , j ) i =1 j =1 j =1

This algorithm is a block-based motion estimation algorithm that utilizes successive elimination techniques [9]. It utilizes two approximate functions, SAD1(m)(u,v) and SAD2(m)(u,v) as the upper and lower boundaries, respectively, of the interval that includes SAD(u,v). It also uses low-resolution blocks for the calculation of SAD1(m)(u,v) and SAD2(m)(u,v) for both, current and reference frame. The character ‘m’ is equal to 2b1 where ‘b1’ is the number of the bits in the pixel intensity starting from the MSB going to the LSB. This scheme can be applied in multiple stages. The intensity value of the pixels is categorized and the absolute difference of two pixels lies in either of these categories. Accordingly the approximate functions are calculated. The generic equations for the calculations are:

SAD1 ( m ) ( X , Y ) = ( N − n1 ) + d × ( n 3 + 2n 4 +

+ (m − 2)n m ) (5) ( m) SAD2 (m) ( X , Y ) ≤ SAD( X , Y ) ≤ SAD (X ,Y ) (6) 1 where ‘N’ is the block size, ‘m’ is the number of categories, and where ‘ni’ is the number of occurrences of one of the categories mentioned above. A possible motion vector (PMV) set is generated containing all the noneliminated candidate blocks. The next step will be a repetition with increased resolution of the pixel intensity values but the number of candidate blocks decreases. The detailed algorithm and an example are shown in Figure 1 and Figure 2. Low Resolution blocks using the 2 most significant bits of the pixel

Low Resolution blocks using the 4 MSBs of the pixel

Second Step Search: SSS

First Step Search: FSS

(4)

The new function D(u, v) proves to be a lower bound of the function SAD(u , v ) . The conservative estimate D(u, v) is not directly proportional to the exact distortion thereby limiting the capability of the algorithm. This can be seen in Figure 2 where SAD(a,b) > SAD(a,c) whereas D(a,b) > D(a,c). The number of blocks eliminated by the algorithm depends heavily on the choice of the starting point. This algorithm has an overhead of computation of the conservative approximate. Depending upon the number of eliminated candidate blocks, the number of memory accesses can be more than equation (2). Now, the Tmem becomes a function of the probability of elimination before the actual SAD is calculated. The worst case will be doubling the number of memory accesses if the conservative approximation does not eliminate any candidate block.

Calculate SAD1 & SAD2, m=16

Calculate SAD1 & SAD2, m=4 Eliminate non-candidate blocks

Determine elements of PRS

Determine elements of PMV set PMV set is new search window Optimal Motion Vectors

PRS is new search window

Use full

Full Search

resolution blocks

Figure 1: MIME algorithm for two stages.

In this algorithm, the main advantage lies in the fact that it eliminates a large number of candidate blocks from the search area. Due to this, there are very few SAD computations that have to be done. The major power

II - 806

➡

➡ saving comes from the fact that the computation of the approximate functions requires negligible hardware as compared to the actual SAD computation. This algorithm works in stages and if a candidate block is not eliminated then the memory needs to be accessed again. The Tmem required by the system again becomes a function of the probability of elimination of a candidate block.

Block a

Block c

Block b

Black pixes has an intensity of 255. White pixel has an intensity of 0. SAD(a,b)=4080 SAD(a,c)=1275 D(a,b)=0 D(a,c) = 1275 SAD1(4)(a,b)=4080

SAD2(4)(a,b)=2032

SAD1(4)(a,c)=1328

pixel x 480 pixel, 30 fps, 8 bps) sequences. A block size of 8x8 is used and the search window is –7 to +7. In these, we observe that at an average, 30% of the candidate blocks are eliminated after performing the FSS. To perform the SSS, memory access has to be made for the remaining 70% of the candidate blocks. Also, the average percentage of reaching the optimal motion vector after its first step is 7%. Owing to these factors, the powerconsumption due to memory accesses becomes a large proportion of the total system cost. One solution to this could have been pipelining of the three steps (see Figure 3). The disadvantage of doing this is that to provide the data to the FSS for a new pair of frames when the FRSS is working on the previous set, either the memory size has to be increased to accommodate the new frames (along with access ports) or another memory module has to be added.

SAD2(4)(a,c)=635

Figure 2: Practical example for MIME Algorithm.

Search Window

5. PROPOSED ARCHITECTURE AND DISCUSSION

Reference Block

8 Bit

8 Bit

2 Bit

2 Bit

Possible Motion Vector Matrix (PMV)

4 Bit

First Step Search Unit (FSSU)

In this section, we discuss the architectural aspects of the MIME algorithm. After incorporating enhancements on the previous architecture [9], a new architecture for MIME algorithm is proposed here. The new architecture utilizes lesser memory accesses to perform the same operation and exploits the use of pipeline stages. The block diagram of the architecture in [9] for the MIME algorithm is shown in Figure 3. This figure shows the three pipeline stages which where not present in the original architecture. These have been made here for comparison with the proposed architecture. The block diagram of the proposed architecture is shown in Figure 4. In our comparison, we have used 2 steps of MIME and the final step is the full-search matching. In the [9], the first step (FSS) is done and a PMV set is generated. Using that set, the second step search (SSS) is done and the resulting possible resolution set (PRS) is used to perform the full-resolution step search (FRSS). Each step is dependent upon the completion of its previous step. If some kind of pipelining is not provided, then valuable clock cycles are wasted which could have been avoided. Secondly, to perform the SAD1(m) and SAD2(m) computations, each step uses a different resolution, therefore it has to access the memory again. As mentioned earlier, power consumption increases with more memory accesses. Also, Tmem now becomes a function of the probability of the number of candidate blocks eliminated after each step. The simulation results for the MIME algorithm for standard benchmark video sequences are shown in Figure 5(a). These simulations use two CIF (325 pixel x 288 pixel, 30 fps, 8 bps) and two CCIR601 (720

4 Bit

Second Step Search Unit (SSSU)

Pipeline Stage - 1

Pipeline Stage - 2

Possible Refined Set Matrix (PRS)

8 Bit

8 Bit

Full Resolution Search Unit (FRSU)

Final Motion Vector

Pipeline Stage - 3

Figure 3: Previous architecture for MIME algorithm. Search Window

Reference Block

8 Bit

8 Bit

2 Bit

2 Bit

Possible Motion Vector Matrix (PMV)

First Step Search Unit (FSSU)

4 Bit

Second Step Search Unit (SSSU)

Pipeline Stage - 1

4 Bit

Possible Refined Set Matrix (PRS)

8 Bit

8 Bit

Full Resolution Search Unit (FRSU)

Final Motion Vector

Pipeline Stage - 2

Figure 4: Block diagram of the proposed architecture showing each pipeline stage.

We propose a solution to this problem by performing the SSS simultaneously with the FSS. This is depicted in Figure 4 as the dotted box and named as combined FSS. The PMV is generated and the SAD1(16) and SAD2(16) are calculated by the end of FSS. Now using the PMV and SAD’ s, the PRS is determined. The motivation behind this choice is two folds. Firstly, as mentioned previously, the FSS eliminates only 30% of the candidate blocks. Secondly and more importantly, the overhead in doing so is very small. The only difference in the FSS and SSS is the SAD(m) module. So, performing both these steps together incurs a small overhead. This overhead is estimated to be much smaller than that incurred if SSS is done separately i.e. 70% of the candidate blocks are searched again. Percentage is given by:

II - 807

  PSAD ( 4 ) + PSAD (16) + Pmem 1 −  × 100%  PSAD ( 4 ) + PSAD (16 ) + Pmem + (Pmem × PFSS )   

(7)

➡

➠ Another advantage of doing so can be quickly realized that now, two steps are performed at the same time. There is a huge amount of saving in terms of time. If we consider that one absolute difference computation takes one clock cycle, we save (1-PFSS).(2w+1)2.(W.H)/N2 clock cycles where PFSS is the probability of eliminating a candidate block in FSS. This makes the proposed architecture not only power-efficient but time-efficient also. Figure 5(b) shows the speedup performance results for the MIME algorithm with the old architecture. At an average, there is an 11 times speedup as compared to FSBMA. We estimate that our architecture almost doubles this speedup making it suitable for high frame rate videos.

6. CONCLUSION We have presented here an architectural implementation for the MIME algorithm. There is significant reduction in the number of memory accesses reducing the overall power consumption. As a consequence, the architecture exhibits almost double speed up as compared to the previous architecture. Since the speedup of the system increases, further power saving can be achieved by scaling the supply voltage in the last step. The decrease in the throughput of the FRSS is compensated by the time saved earlier. The future work will be to study the issues involved in the above suggested scheme.

Avg. percentage probability

100

7. ACKNOWLEDGEMENTS

90 80 70

The authors acknowledge the support of the U.S. Department of Energy (DoE), EETAPP program, DE97ER12220 and the Governor’s Information Technology Initiative.

After FSS

60 50 40

After SSS

30 20 10

8. REFERENCES

0 Tennis Table

Football

Claire

Miss America

[1] C. Cafforio and F. Rocca, “Methods for measuring small displacements of television images,” IEEE Trans. Inform. Thoery, vol. IT-22, no. 5, pp. 573-579, Sept. 1976.

14 12

Speedup

10

[2] M. Tekalp, Digital video processing, Prentice-Hall, Englewood Cliffs, NJ, 1995.

MIME

8 6

[3] J. Jain and A. Jain, “Displacement measurement and its applications in interframe coding,” IEEE Trans. on Communications, vol. 29, no. 12, pp. 1799-808, Dec 1981.

Exhaustiv e full search

4 2 0 Tennis Table

Football

Claire

Miss America

Figure 5: (a) Probability of finding the optimal MV. (b) The speedup of MIME algorithm. X Y

SAD(4)1

SAD(4)2

SAD(16)1

SAD(16)2

FSS Unit

MVG

[7] G. Yeh, Y. Lu, and J.Burr, “A low-power video motion estimation array processor,” in 1996 Symposium on VLSI Circuits Digest of Technical Papers, June 1996, pp. 162-3.

Temporary Buffer

Temporary Buffer

SMIN

SMIN

mvY COMPARATOR

COMPARATOR

PMV Matrix

PRS Matrix

Pipeline Stage - 1

[5] W. Badawy and M. A. Bayoumi, “Algorithm-based lowpower VLSI architecture for 2-D mesh video-object motion tracking,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 12, no. 4, April 2002. [6] L. M. Po and W. C. Ma, “A novel four step search algorithm for fast block motion estimation,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 6, pp. 313-317, June 1996.

SAD(16) Module

SAD(4) Module

[4] S. Kim, Y. Kim, K. Kim, H. Chung, K. Choi, Y. Kim and G. Jung, “A fast motion estimator for real time system,” IEEE Trans. on Consumer Electronics, vol. 43, no. 1, pp. 24-33, 1997.

[8] W. T. Shiue and C. Chakrabarti, “Memory exploration for low-power, embedded systems,” IEEE Conference on Design Automation Conference, New Orleans, 1999, pp 140-145.

mvX

Pipeline Stage - 2

Figure 6: Proposed architecture.

Figure 6 shows the architecture for the proposed scheme. The pipelining scheme has been incorporated in the figure to reduce power-consumption.

[9] H. Mahmoud, S. Goel, M. Shaaban, T. Darwish and M. Bayoumi, “A low-power VLSI architecture for multi-stage interval-based motion estimation (MIME) algorithm,” Proc. of the Intl. Workshop on Digital and Computational Video, 2002. [10] Viet. L. Do and Kenneth Y. Yun, ”A low-power architecture for full-search block-matching motion estimation,” IEEE Trans. On Circuits and Systems for Video Technology, vol. 8, no. 4, pp. 393-398, August 1998.

II - 808

Memory Access Reduction for MIME Algorithm - Semantic Scholar

Memory Access Reduction for MIME Algorithm - Semantic Scholar

Suggest Documents

Group Mime Group Mime Solo Mime Solo Mime

MAPG: Memory Access Power Gating - Semantic Scholar

Support Vector Reduction in SVM Algorithm for ... - Semantic Scholar

A New Algorithm for Model Order Reduction of ... - Semantic Scholar

A Global Reduction Based Algorithm for ... - Semantic Scholar

An Efficient Low Memory Implicit DG Algorithm for ... - Semantic Scholar

A Decomposition Algorithm for Local Access ... - Semantic Scholar

MIME 2016 DOCTORAL SCHOOL, oSIJEK - MIME project

Memory Bandwidth and Power Reduction Using ... - Semantic Scholar

Memory Bandwidth Reduction in Video Coding ... - Semantic Scholar

Semantic memory. - Semantic Scholar

Bandwidth and Local Memory Reduction of Video ... - Semantic Scholar

Algorithm 841: BHESS: Gaussian Reduction to a ... - Semantic Scholar

The viterbi algorithm and markov noise memory - Semantic Scholar

Memory-E cient Self-Stabilizing Algorithm to ... - Semantic Scholar

Visualizing the Memory Access Behavior of ... - Semantic Scholar

Global Memory Net Offers New Innovative Access ... - Semantic Scholar

Low-energy Resistive Random Access Memory ... - Semantic Scholar

Exploiting Memory Access Patterns to Improve ... - Semantic Scholar

Access to Attitude-Relevant Information in Memory ... - Semantic Scholar

Access to Information in Working Memory - Semantic Scholar

Uranium Reduction - Semantic Scholar

Memory Allocation - Semantic Scholar

Working Memory - Semantic Scholar