University of Louisiana at Lafayette. [sxg4999,mmm5554,tkd5171, ham8251,mab] @ cacs.louisiana.edu. ABSTRACT ..... Football. Claire. Miss America. A v g . p.
➠
➡ Memory Accesses Reduction for MIME Algorithm Sumeer Goel, Mohsen Shaaban, Tarek Darwish, Hanan Mahmoud and Magdy Bayoumi Center for Advanced Computer Studies (CACS) University of Louisiana at Lafayette [sxg4999,mmm5554,tkd5171, ham8251,mab] @ cacs.louisiana.edu ABSTRACT Power consumption of digital systems has become a critical design parameter. An important class of digital systems includes applications such as video image processing and speech recognition, which are extremely memory dominant. In such systems, a significant amount of power is consumed during memory accesses. Reducing the number of memory accesses can considerably impact the power dissipation in the rest of the design. Therefore, optimizing an application for reduced memory access can greatly effect the overall power consumption in the entire system. This paper presents an architectural enhancement Multi-Stage Interval-based Motion Estimation (MIME) algorithm that not only saves power by reducing the number of memory accesses but also significantly increases the speedup. 1. INTRODUCTION Motion estimation aims at compressing the amount of data necessary to transmit a video sequence across a bandwidth-limited channel. One popular technique for motion estimation is the block-matching algorithm (BMA) [1]. In this technique, the current image frame is first partitioned into fixed-sized rectangular blocks, and the motion vectors for each block is estimated by finding the closest block of pixels in the previous frame according to a matching criterion. Full-search block-matching (FSBM) algorithm [2] provides optimum performance by searching all the blocks in the search window but proves to be computationally expensive. The power consumption and computational cost of these search algorithms can be reduced at different levels of abstraction. Several cost effective techniques at algorithmic level have been proposed in the literature [3][6][7]. These modifications address the problem of power consumption but compromise on the complexity of the approach and do not address the vital issues related to memory access. Most of them concentrate on data flow within the processing element (PE). However, this approach can cause data flow problems outside the PE.
0-7803-7965-9/03/$17.00 ©2003 IEEE
If the memory has both the current and previous frame in full, the total number of memory accesses (Tmem) required to perform the FSBMA is the sum of the accesses of previous frame and accesses of the current frame, which is expressed as: Tmem = f×W×H×RACcurr + f×W×H×RACprev current frame
(1)
previous frame
where f is the frame rate, W and H are the width and height of frame, RACcurr and RACprev are the redundancy access count (RAC) for both current and previous frames. The RAC is defined to be the ratio of how many times the same pixel is accessed from the memory. Depending upon the architectural implementation, the RAC can vary. If the memory traffic is restricted to one reference block of N2 pixels and one search area of (2w+N)2, then it should read the reference block and candidate block each time it computes the motion vector. In this case, the memory accesses required is given by: Tmem = (W×H +(2w+n)2×W×H/N2)× f
(2)
According to the authors of [8], in a SRAM (CY7C136-133 from Cypress) of size 2M bits that has an access time of 4ns, voltage of 3.3v and current of 375mA, the power consumption per memory access is 4.95 nJ. With this in consideration, an algorithm that frequently utilizes the memory is liable to spend more power than an algorithm that is less memory intensive. In this paper, we present an architectural enhancement to a present BMA architecture to achieve a reduction in the memory access requirements making the architecture faster and powerefficient. The architecture also exploits the use of pipeline stages without compromising on total throughput of the system. In the next few sections, we discuss the different block matching algorithms and their disadvantages. In section 5 we present the proposed architectural enhancements and compare it with the architecture presented in [9].
II - 805
ICME 2003
➡
➡ 2. FULL-SEARCH BLOCK-MATCHING ALGORITHMS
4. MIME ALGORITHM
FSBMA finds the best match for each reference block of size N x N in the current frame within a search area S in the previous frame. The criterion for best match is the candidate block with the minimum amount of distortion when compared with the reference block. The most common one used is the Sum of Absolute Differences (SAD) of intensity values between the two blocks being compared. The SAD for the candidate block of size N x N at position (u,v) can be defined as: N N SAD(u , v ) = ∑ ∑ s (i + u , j + v ) − r (i, j ) i =1 j =1
(3)
where r (i, j ) and s (i + u , j + v) are intensity values at position (i, j ) of the reference block and (i + u , j + v) in the candidate block in search area S. The block matching process generates a motion vector (u, v) min and the corresponding distortion value, SADmin . FSBM is widely used because of its simplicity and regularity, but it needs massive computations and an expensive hardware. 3. FSBMA BASED ON CONSERVATIVE APPROXIMATION FSBMA based on conservative approximation uses conservative approximation [9] of SAD(u , v ) for the estimation of motion vectors. The calculation of the new D(u, v) is less expensive as compared to the conventional SAD(u , v ) in terms of VLSI power consumption. The conservative estimate D (u , v) is defined as: N −1 N N D (u , v ) = ∑ ∑ s ( i + u , j + v ) − ∑ r (i , j ) i =1 j =1 j =1
This algorithm is a block-based motion estimation algorithm that utilizes successive elimination techniques [9]. It utilizes two approximate functions, SAD1(m)(u,v) and SAD2(m)(u,v) as the upper and lower boundaries, respectively, of the interval that includes SAD(u,v). It also uses low-resolution blocks for the calculation of SAD1(m)(u,v) and SAD2(m)(u,v) for both, current and reference frame. The character ‘m’ is equal to 2b1 where ‘b1’ is the number of the bits in the pixel intensity starting from the MSB going to the LSB. This scheme can be applied in multiple stages. The intensity value of the pixels is categorized and the absolute difference of two pixels lies in either of these categories. Accordingly the approximate functions are calculated. The generic equations for the calculations are:
SAD1 ( m ) ( X , Y ) = ( N − n1 ) + d × ( n 3 + 2n 4 +
+ (m − 2)n m ) (5) ( m) SAD2 (m) ( X , Y ) ≤ SAD( X , Y ) ≤ SAD (X ,Y ) (6) 1 where ‘N’ is the block size, ‘m’ is the number of categories, and where ‘ni’ is the number of occurrences of one of the categories mentioned above. A possible motion vector (PMV) set is generated containing all the noneliminated candidate blocks. The next step will be a repetition with increased resolution of the pixel intensity values but the number of candidate blocks decreases. The detailed algorithm and an example are shown in Figure 1 and Figure 2. Low Resolution blocks using the 2 most significant bits of the pixel
Low Resolution blocks using the 4 MSBs of the pixel
Second Step Search: SSS
First Step Search: FSS
(4)
The new function D(u, v) proves to be a lower bound of the function SAD(u , v ) . The conservative estimate D(u, v) is not directly proportional to the exact distortion thereby limiting the capability of the algorithm. This can be seen in Figure 2 where SAD(a,b) > SAD(a,c) whereas D(a,b) > D(a,c). The number of blocks eliminated by the algorithm depends heavily on the choice of the starting point. This algorithm has an overhead of computation of the conservative approximate. Depending upon the number of eliminated candidate blocks, the number of memory accesses can be more than equation (2). Now, the Tmem becomes a function of the probability of elimination before the actual SAD is calculated. The worst case will be doubling the number of memory accesses if the conservative approximation does not eliminate any candidate block.
Calculate SAD1 & SAD2, m=16
Calculate SAD1 & SAD2, m=4 Eliminate non-candidate blocks
Determine elements of PRS
Determine elements of PMV set PMV set is new search window Optimal Motion Vectors
PRS is new search window
Use full
Full Search
resolution blocks
Figure 1: MIME algorithm for two stages.
In this algorithm, the main advantage lies in the fact that it eliminates a large number of candidate blocks from the search area. Due to this, there are very few SAD computations that have to be done. The major power
II - 806
➡
➡ saving comes from the fact that the computation of the approximate functions requires negligible hardware as compared to the actual SAD computation. This algorithm works in stages and if a candidate block is not eliminated then the memory needs to be accessed again. The Tmem required by the system again becomes a function of the probability of elimination of a candidate block.
Block a
Block c
Block b
Black pixes has an intensity of 255. White pixel has an intensity of 0. SAD(a,b)=4080 SAD(a,c)=1275 D(a,b)=0 D(a,c) = 1275 SAD1(4)(a,b)=4080
SAD2(4)(a,b)=2032
SAD1(4)(a,c)=1328
pixel x 480 pixel, 30 fps, 8 bps) sequences. A block size of 8x8 is used and the search window is –7 to +7. In these, we observe that at an average, 30% of the candidate blocks are eliminated after performing the FSS. To perform the SSS, memory access has to be made for the remaining 70% of the candidate blocks. Also, the average percentage of reaching the optimal motion vector after its first step is 7%. Owing to these factors, the powerconsumption due to memory accesses becomes a large proportion of the total system cost. One solution to this could have been pipelining of the three steps (see Figure 3). The disadvantage of doing this is that to provide the data to the FSS for a new pair of frames when the FRSS is working on the previous set, either the memory size has to be increased to accommodate the new frames (along with access ports) or another memory module has to be added.
SAD2(4)(a,c)=635
Figure 2: Practical example for MIME Algorithm.
Search Window
5. PROPOSED ARCHITECTURE AND DISCUSSION
Reference Block
8 Bit
8 Bit
2 Bit
2 Bit
Possible Motion Vector Matrix (PMV)
4 Bit
First Step Search Unit (FSSU)
In this section, we discuss the architectural aspects of the MIME algorithm. After incorporating enhancements on the previous architecture [9], a new architecture for MIME algorithm is proposed here. The new architecture utilizes lesser memory accesses to perform the same operation and exploits the use of pipeline stages. The block diagram of the architecture in [9] for the MIME algorithm is shown in Figure 3. This figure shows the three pipeline stages which where not present in the original architecture. These have been made here for comparison with the proposed architecture. The block diagram of the proposed architecture is shown in Figure 4. In our comparison, we have used 2 steps of MIME and the final step is the full-search matching. In the [9], the first step (FSS) is done and a PMV set is generated. Using that set, the second step search (SSS) is done and the resulting possible resolution set (PRS) is used to perform the full-resolution step search (FRSS). Each step is dependent upon the completion of its previous step. If some kind of pipelining is not provided, then valuable clock cycles are wasted which could have been avoided. Secondly, to perform the SAD1(m) and SAD2(m) computations, each step uses a different resolution, therefore it has to access the memory again. As mentioned earlier, power consumption increases with more memory accesses. Also, Tmem now becomes a function of the probability of the number of candidate blocks eliminated after each step. The simulation results for the MIME algorithm for standard benchmark video sequences are shown in Figure 5(a). These simulations use two CIF (325 pixel x 288 pixel, 30 fps, 8 bps) and two CCIR601 (720
4 Bit
Second Step Search Unit (SSSU)
Pipeline Stage - 1
Pipeline Stage - 2
Possible Refined Set Matrix (PRS)
8 Bit
8 Bit
Full Resolution Search Unit (FRSU)
Final Motion Vector
Pipeline Stage - 3
Figure 3: Previous architecture for MIME algorithm. Search Window
Reference Block
8 Bit
8 Bit
2 Bit
2 Bit
Possible Motion Vector Matrix (PMV)
First Step Search Unit (FSSU)
4 Bit
Second Step Search Unit (SSSU)
Pipeline Stage - 1
4 Bit
Possible Refined Set Matrix (PRS)
8 Bit
8 Bit
Full Resolution Search Unit (FRSU)
Final Motion Vector
Pipeline Stage - 2
Figure 4: Block diagram of the proposed architecture showing each pipeline stage.
We propose a solution to this problem by performing the SSS simultaneously with the FSS. This is depicted in Figure 4 as the dotted box and named as combined FSS. The PMV is generated and the SAD1(16) and SAD2(16) are calculated by the end of FSS. Now using the PMV and SAD’ s, the PRS is determined. The motivation behind this choice is two folds. Firstly, as mentioned previously, the FSS eliminates only 30% of the candidate blocks. Secondly and more importantly, the overhead in doing so is very small. The only difference in the FSS and SSS is the SAD(m) module. So, performing both these steps together incurs a small overhead. This overhead is estimated to be much smaller than that incurred if SSS is done separately i.e. 70% of the candidate blocks are searched again. Percentage is given by:
II - 807
PSAD ( 4 ) + PSAD (16) + Pmem 1 − × 100% PSAD ( 4 ) + PSAD (16 ) + Pmem + (Pmem × PFSS )
(7)
➡
➠ Another advantage of doing so can be quickly realized that now, two steps are performed at the same time. There is a huge amount of saving in terms of time. If we consider that one absolute difference computation takes one clock cycle, we save (1-PFSS).(2w+1)2.(W.H)/N2 clock cycles where PFSS is the probability of eliminating a candidate block in FSS. This makes the proposed architecture not only power-efficient but time-efficient also. Figure 5(b) shows the speedup performance results for the MIME algorithm with the old architecture. At an average, there is an 11 times speedup as compared to FSBMA. We estimate that our architecture almost doubles this speedup making it suitable for high frame rate videos.
6. CONCLUSION We have presented here an architectural implementation for the MIME algorithm. There is significant reduction in the number of memory accesses reducing the overall power consumption. As a consequence, the architecture exhibits almost double speed up as compared to the previous architecture. Since the speedup of the system increases, further power saving can be achieved by scaling the supply voltage in the last step. The decrease in the throughput of the FRSS is compensated by the time saved earlier. The future work will be to study the issues involved in the above suggested scheme.
Avg. percentage probability
100
7. ACKNOWLEDGEMENTS
90 80 70
The authors acknowledge the support of the U.S. Department of Energy (DoE), EETAPP program, DE97ER12220 and the Governor’s Information Technology Initiative.
After FSS
60 50 40
After SSS
30 20 10
8. REFERENCES
0 Tennis Table
Football
Claire
Miss America
[1] C. Cafforio and F. Rocca, “Methods for measuring small displacements of television images,” IEEE Trans. Inform. Thoery, vol. IT-22, no. 5, pp. 573-579, Sept. 1976.
14 12
Speedup
10
[2] M. Tekalp, Digital video processing, Prentice-Hall, Englewood Cliffs, NJ, 1995.
MIME
8 6
[3] J. Jain and A. Jain, “Displacement measurement and its applications in interframe coding,” IEEE Trans. on Communications, vol. 29, no. 12, pp. 1799-808, Dec 1981.
Exhaustiv e full search
4 2 0 Tennis Table
Football
Claire
Miss America
Figure 5: (a) Probability of finding the optimal MV. (b) The speedup of MIME algorithm. X Y
SAD(4)1
SAD(4)2
SAD(16)1
SAD(16)2
FSS Unit
MVG
[7] G. Yeh, Y. Lu, and J.Burr, “A low-power video motion estimation array processor,” in 1996 Symposium on VLSI Circuits Digest of Technical Papers, June 1996, pp. 162-3.
Temporary Buffer
Temporary Buffer
SMIN
SMIN
mvY COMPARATOR
COMPARATOR
PMV Matrix
PRS Matrix
Pipeline Stage - 1
[5] W. Badawy and M. A. Bayoumi, “Algorithm-based lowpower VLSI architecture for 2-D mesh video-object motion tracking,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 12, no. 4, April 2002. [6] L. M. Po and W. C. Ma, “A novel four step search algorithm for fast block motion estimation,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 6, pp. 313-317, June 1996.
SAD(16) Module
SAD(4) Module
[4] S. Kim, Y. Kim, K. Kim, H. Chung, K. Choi, Y. Kim and G. Jung, “A fast motion estimator for real time system,” IEEE Trans. on Consumer Electronics, vol. 43, no. 1, pp. 24-33, 1997.
[8] W. T. Shiue and C. Chakrabarti, “Memory exploration for low-power, embedded systems,” IEEE Conference on Design Automation Conference, New Orleans, 1999, pp 140-145.
mvX
Pipeline Stage - 2
Figure 6: Proposed architecture.
Figure 6 shows the architecture for the proposed scheme. The pipelining scheme has been incorporated in the figure to reduce power-consumption.
[9] H. Mahmoud, S. Goel, M. Shaaban, T. Darwish and M. Bayoumi, “A low-power VLSI architecture for multi-stage interval-based motion estimation (MIME) algorithm,” Proc. of the Intl. Workshop on Digital and Computational Video, 2002. [10] Viet. L. Do and Kenneth Y. Yun, ”A low-power architecture for full-search block-matching motion estimation,” IEEE Trans. On Circuits and Systems for Video Technology, vol. 8, no. 4, pp. 393-398, August 1998.
II - 808