IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002
227
Algorithm-Based Low-Power VLSI Architecture for 2-D Mesh Video-Object Motion Tracking Wael Badawy and Magdy Bayoumi
Abstract—This paper presents an algorithm-based low-power VLSI architecture for video object (VO) motion tracking. The new architecture uses a novel hierarchical adaptive structured mesh topology. The structured mesh offers a significant reduction in the number of bits that describe the mesh topology. The motion of the mesh nodes represents the deformation of the VO. The motion compensation is performed using a multiplication-free algorithm for affine transformation, which significantly reduces the complexity of the decoder architecture. Moreover, pipelining the affine unit contributes a considerable savings of power. The VO motion-tracking architecture is based on a new algorithm for tracking a VO. It consists of two main parts: a video object motion-estimation unit (VOME) and a video object motion-compensation unit (VOMC). The VOME processes two consequent frames to generate a hierarchical adaptive structured mesh and the motion vectors of the mesh nodes. It implements parallel block matching motion-estimation units to optimize the latency. The VOMC processes a reference frame, mesh nodes, and motion vectors to predict a video frame. It implements parallel threads in which each thread implements a pipelined chain of scalable affine units. This motion-compensation algorithm allows the use of one simple warping unit to map a hierarchical structure. The affine unit warps the texture of a patch at any level of hierarchical mesh independently. The processor uses a memory serialization unit, which interfaces the memory to the parallel units. The architecture has been prototyped using top-down low-power design methodology. The performance analysis shows that this processor can be used in online object-based video applications such as in MPEG-4 and VRML. Index Terms—Affine transformation, low bit-rate, low power, mesh-based motion tracking, motion compensation, motion estimation, motion tracking, video architecture, video object.
I. INTRODUCTION
R
ECENT advances in the field of video-based communication applications, such as videophone and video conferencing systems, concentrate mainly on minimizing the size and the cost of the coding and encoding equipment. Real-time video systems process huge amounts of data and need a large communication bandwidth. Real-time video applications include strategies to compress the data into a Manuscript received December 10, 1999; revised August 11, 2001. This work was supported by the U.S. Department of Energy (DoE) under EETAP Program (1997–2000), and by the LEQSF (1996–1999) RD-B-13, State of Louisiana. This paper was recommended by Associate Editor A. Luthra. M. Badawy is with the Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta T2N 1N4, Canada (e-mail:
[email protected]). M. Bayoumi is with the Center for Advanced Computer Studies, University of Louisiana, Lafayette, LA 70504-4330 USA (e-mail:
[email protected]). Publisher Item Identifier S 1051-8215(02)04286-6.
feasible size. Among different data-compression techniques, video object (VO) representation—as addressed by MPEG-4 [9]—allows for content-based authoring, coding, distribution, search, and manipulation of video data. In MPEG-4, the VO refers to spatio–temporal data pertinent to a semantically meaningful part of the video. A 2-D snapshot of a VO is referred to as a VO plane (VOP) [10]. A 2-D triangular mesh is designed on the first appearance of the VOP as an extension of the 3-D modeling. The vertices of the triangular patches are referred to as the mesh nodes. The nodes of the initial mesh are then tracked from VOP to VOP. Therefore, the motion vectors of the node points in the mesh represent the 2-D motion of each VO. The motion compensation is achieved by triangle wrapping from VOP to VOP using spatial transformation [9]. Recently, hierarchical mesh representation has attracted attention because it provides rendering at various levels of detail. It also allows scalable/progressive transmission of the mesh geometry and motion vectors [10], [1]. The mesh can be coded at different resolutions to satisfy the bandwidth constraints and/or the QoS requirements. Hierarchical 2-D-mesh-based modeling of video sources has been previously addressed for the case of uniform topology only. The mesh is defined for coarse-to-fine hierarchy, which was trivially achieved by subdividing each triangle/quadrangle into three or four subtriangles or quadrangles [12]. The Hierarchical Adaptive Structured Mesh (HASM) is a proposed technique for dividing the mesh based on its dynamics [7]. This approach generates a content-based mesh and optimizes the computational overhead required to send the mesh’s geometry. The motion-estimation algorithm for the node is based on the block-based matching [8], [11], which assumes that all pels in the block have the same motion parameters. The block-based matching still requires tremendous computation. The motion-compensation algorithm uses a reference frame, mesh plane, and node motion vectors to predict a frame. For each mesh triangular patch, the algorithm computes the affine mapping coefficients. It uses the coefficient to warp the texture from a reference frame to a predicted frame [5]. The computationally intensive nature of video applications and the demand for real-time processing necessitates the use of VLSI implementation. With the rapid advance of VLSI technology, the attributes of parallelism, pipelining, concurrency, and regularity have become a new set of criteria in designing the hardware for digital processing. With highly sophisticated processing schemes at hand and further promising advances in video technology, efficient VLSI implementation is of great importance. Broad acceptance of new video applications critically depends on the availability of compact and inexpensive
1051-8215/02$17.00 © 2002 IEEE
228
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002
hardware delivering the required high performance. Although conventional digital signal processors are highly optimized for processing speech and audio signals, they lack the required high performance for video signal processing. Programmable high-end general-purpose processors, as designed for the PC and workstation markets, may reach higher performance levels for desktop computing. However, they are typically weak at signal processing, and they are too expensive and consume considerable power for standalone video processing applications. Therefore, special architectural approaches are required for efficient hardware solutions delivering sufficient video processing performance at low cost [2], [3]. A survey of the literature reveals that only a few VLSI architectures exit for codecs using VO motion tracking exists. With the advances of the mobile communication and personal communication services (PCSs), there is a trend to run video applications on mobile systems. High-speed data and computation-intensive processing will lead too much higher power consumption than traditional portable applications. Due to the limited power-supply capability of current battery technology, current research efforts face two conflicting design challenges. One is to explore high- performance design and implementation techniques that can meet the stringent speed constraints for real-time systems. The second is to consider low-power design approaches to prolong the operating time of portable devices. Power consumption has been introduced as a major design parameter due to the fast growing technology in mobile communications, which require portable power supplies. There are three main sources of power dissipation in a CMOS circuit: 1) switching power, which is consumed by the circuit capacitance during transistor switching; 2) short-circuit power, which is due to current flowing from power supply to ground during transistor switching; and 3) static power, which is due to leakage and static currents. The switching and short circuit power are known as dynamic power, which constitutes about 99% of the total power dissipation in CMOS VLSI circuits. , is given by the following equation: Dynamic power,
where system clock frequency; power supply voltage; capacitance at node ; voltage swing at node ; switching probability at node ; short-circuit current at node . As system speed increases, the increase in power consumption becomes a major problem. The increase in the power dissipation will generate extra heat and reduce the reliability of a circuit. The development of techniques allowing low-power circuits is crucial to the practicality of future products. Due contributes a sigto its quadratic dependence, lowering nificant power saving. However, the delay in processing time decreases. To meet the throughput requirewill increase as ments, the delay must be compensated for at the system and algorithmic/architecture design level. Simply, a new version of the design that allows more latency is required.
Low-power system-level approaches involve two main techniques. One is partitioning the system into modules and shutting down modules that are not working, such as multichip modules (MCMs) [13]. The other approach is to use a monitor scheduler to distribute the workload over several functional units. The scheduler can be a hardware or software monitor. Besides the system-level low-power approaches, the algorithmic/architectural approach is the most promising one [14]. Reformulating the algorithms and mapping them to efficient low-power VLSI architectures to compensate for the speed penalty caused by low supply voltage can achieve the algorithmic/architectural low-power design. The power savings of the algorithmic/architectural level are in the range of 70%–90% [3]. The current goal is to reduce the total power dissipation of the electronic systems to two orders of magnitude less than what it would have been with the conventional technology [15]. Approximations and parallel and pipelined design approaches are used to compensate for the delay penalty [14], [16]. Low-power and high-performance architectures can be achieved using approximation approaches which trade high performance with the cost of the circuits, pipelining which trades the overhead of extra buffers with latency, and parallel architectures which trade the area with latency. In the approximation approaches, the constraints are relaxed in order to reduce the cost and yield acceptable performance. Approximation approaches generally lead to much simplified architectures, which then add to the savings in total implementation cost. For example, three-step motion estimation [17] is usually employed to reduce the computational complexity while maintaining a performance comparable with the full-search motion-estimation schemes [18]. Pipelining is a commonly used technique to achieve high-speed data processing. The main idea behind pipelining a design is shortening the delay through the critical path by inserting latches between consecutive pipeline stages. As a result, the speed of the system is faster than that of the original system at the penalty of increasing the latency by the number of pipelining stages. On the other hand, the pipelining can be used to compensate for the delay incurred in the low-power design when the supply voltage drops significantly. Some derivations of the pipelining techniques are asynchronous circuit pipelining [19], two-level (word/bit-level) pipelining [20], and bit-level pipelining approaches [21]. Parallel data processing decomposes the desired function into independent and parallel small tasks. These tasks are executed concurrently, and individual results are combined to obtain the desired computation result. It is a “divide-and-conquer” strategy in solving problems. The fast Fourier transform (FFT) [22] is a famous example where the algorithm first decomposes -point FFTs, and expands the an -point FFT into two computation recursively to form a parallel computing network. This paper presents an algorithm-based low-power VLSI architecture for video-object motion tracking. The architecture is divided into two main parts: the mesh-based motion-estimation block and the mesh-based motion-compensation block. The mesh-based motion-estimation block uses parallel-node motion-estimation units. The mesh-based motion-compensation block uses parallel-pipelined threads, which implement
BADAWY AND BAYOUMI: ALGORITHM-BASED LOW-POWER VLSI ARCHITECTURE
(a)
(b)
229
(c)
(d)
(e)
Fig. 1. Video frame representation. (a) Original frame. (b) Frame background. (c) Object texture. (d) Binary layer. (e) Hierarchical mesh layer.
scalable affine units. The affine units can be used with any level of hierarchical mesh and they warp the mesh patches independently. This architecture can be used as a building block for object-based video applications, such as MPEG-4 and VRML. The paper is organized as follows. Section II describes the proposed low-power mesh-based motion-estimation compensation techniques, Section III presents the proposed architecture, Section IV describes the main building blocks, and Section V presents the prototype and the performance measures for the proposed architecture. Finally, Section VI concludes this paper. II. VIDEO-OBJECT MOTION TRACKING MODEL A 2-D mesh is a planar graph that partitions a 2-D image region into polygonal patches consisting of triangles or quadrangles. The vertices of the polygonal patches are referred to as node points. Mesh-based motion modeling differs from blockbased motion modeling in that the polygonal patches in the current frame are deformed by the movement of the node points into polygonal patches in the reference frame. The texture inside each patch in the reference frame is warped into the current frame using parametric mapping as a function of the node-point motion vectors. Affine mapping is a common choice and is recommended by MPEG-4 [9]. An advantage of the mesh model over the block-based model is its ability to represent more general types of motions, such as rotation, scaling, reflection, translation, and shear. However, a drawback of mesh-based models is that they do not allow discontinuities in the motion field [5], [6]. The video-object tracking model tracks the changes of the video frames by tracking independent VOs. The video frame is segmented into independent VOs, each object represented using a hierarchical mesh and texture [7]. The motion vectors of the mesh nodes carry the transformation of the VO across the video sequence. The deformation of a triangular patch represents the motion of the image points on a common rigid plane [9]. The predicted images are synthesized by mapping the patches of the previous frame into the corresponding patches of the current frame. This mapping is easily realized since the transformation between two triangles is described by a 2-D affine transform. A hierarchical mesh-based motion estimation and compensation technique is proposed in [7]. It uses a hierarchical mesh representation with multiplication-free affine transformation. The basic idea is to divide the image frame [Fig. 1(a)] into a background [Fig. 1(b)], a VO texture [Fig. 1(c)] binary layer [Fig. 1(d)], and a hierarchical triangular patch [Fig. 1(e)], and then estimate the motion vectors of the nodes of the triangular patches.
Fig. 2.
Proposed architecture.
Fig. 3.
Proposed sequence of the motion vectors for level L .
The motion-estimation algorithm processes two frames and generates a 2-D mesh and the motion vector for each mesh node. The motion-compensation algorithm processes a reference frame, a mesh and motion vectors to predict a frame. The predicted frames are synthesized by mapping the patches of the previous frame onto the corresponding patches of the current frame. The image construction uses an affine transformation to map the patch texture. It has been proven that the affine transformation has a small computational cost regardless the shape of the patch [6]. III. PROPOSED ARCHITECTURE The proposed architecture is divided into a video object motion-estimation unit (VOME) and a video object motion-compensation unit (VOMC). The VOME increases the level of parallelism by using more than one block matching motion-estimation unit. The VOMC uses parallel scalable affine units to reduce the latency of the warping operation. Note that the use of
230
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002
Fig. 4. Growth of the mesh construction. (a) Level 0. (b) Level 1. (c) Level 3.
the multiplication-free affine transformation reduces the complexity of the warping unit as well. A. The VOME The block diagram for the VOME is shown in Fig. 2. It consists of: 1) two memory blocks, the first one (C frame) containing the reference frame and the second (P frame) containing the current frame; 2) a stack unit, which stores the motion vector, information needed to perform the recursive triangulation operations; a register array, which stores the motion vector and node locations; 3) a control unit, which synchronizes the operation between different blocks. and The architecture processes the initial mesh nodes in sequence, as shown in Fig. 3, to maximize the reuse of the data and minimize the internal storage. The architecture, as shown in Fig. 2, implements three motion-estimation units to increase the level of parallelism. In the presented prototype, each triangular patch in level is split into four triangles, which requires the addition of three nodes. As shown in Fig. 4 the mesh, at level 0, is comprised of regular triangular patches. At level 1, new nodes are added. The addition of the new nodes depends on the difference between the motion vectors of each patch. Note that each time three nodes are added and at most three motion vectors are estimated (each node estimated only once). The operation of the VOME is summarized in Fig. 5. The mesh-based motion-estimation architecture uses ten states. To describe the operation, we assume a square frame and the mesh at level 0 start with eight triangular patches as shown in Fig. 6. In State 1, the C and P frames are loaded into the C and P buffers. The size of the buffer will be the size of the frame. To process the QCIF format, the size of each frame will be 176 144 bytes. In State 2, the stack is initialized. The mesh nodes at level 0 are added as shown in Fig. 7, where the stack has nine entries, each of which represents a node and the white row is the top of the stack. The stack entry is shown in Fig. 8. The size of each entry proposed for the prototype is as follows: three nodes for each patch using six bits for each node; the , locations to represent the QCIF format; and four bits for the motion vector value since the search window size is [ 7, 7]. The motion values are normalized [0, 14] and we use the value 15 as the initial value to indicate that the motion is not estimated yet. Two bits are used for representing the level of the triangle, and two bits are used to define the patch type . The patch type is
Fig. 5.
Mesh coding and motion-estimation FSM.
Fig. 6. Addition of the first entry in level L1 to the current mesh.
as defined in Fig. 6; initially, we have only two types, which define the orientation of the patch. After a split, we will have four different types of patches. The patch type is used for coding purposes. The total size of the entry in our case is 64 bits, and the nodes and up to levels is stack size for a mesh with , where is the number of bits for each entry. State 2 loads the three nodes from the stack and places them in the register array as shown in Fig. 9(a). State 3 checks if the motion vector is equal to 15, then State 4 estimates the motion vectors and places them in the register array, as shown in Fig. 9(b).
BADAWY AND BAYOUMI: ALGORITHM-BASED LOW-POWER VLSI ARCHITECTURE
231
Fig. 7. Stack initialization.
Fig. 8.
Stack word entry in bits.
(a)
(b)
(c)
Fig. 9. Register array. (a) After loading the first patch. (b) After estimating the motion vectors. (c) After splitting occurs.
Fig. 10.
The stack after the first split.
Note that State 4 will be executed only for patches in level 0. State 5 checks for splitting condition. If a splitting occurs, the new three nodes’ , locations will be placed in the register array as in State 7. State 8 estimates the motion vectors for the three new nodes. In State 9, the nodes are pushed to the stack in the proper node order. Fig. 10 shows the stack after splitting the first patch. State 6 checks if the stack is empty; if so, it will load two new frames and the system will perform another cycle. The splitting unit as shown in Fig. 11 tests the condition for splitting. It compares the motion vector component ) for three nodes of the triangular patch. In ( this implementation, a 4-bit motion vector representation is considered with threshold of value of four. The difference will be considered as bit matching for the most significant two bits. B. The VOMC Architecture
Fig. 11.
The motion-compensation architecture is shown in Fig. 12 and its operation summarized in Fig. 14. It uses four scalable affine units, each having independent access to the frame buffer.
The size of the frame to map an image will be , since the motion vector value is limited by ( 7,
Mesh splitting logic.
232
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002
Fig. 12.
Block diagram for the mesh-based motion-compensation architecture.
Fig. 13.
Frame allocation for Lena (512
2 512).
7). In this prototype, the frame buffer is divided into four quadrants, as shown in Fig. 13. The buffers are not disjoined; there is an overlap region between buffers to accommodate the parallel operation. The mesh coding buffer unit loads the hierarchical mesh. The memory serialization is used to synchronize writing to the current frame. The control unit synchronizes the affine unit operations. C. The Scalable Affine Unit The scalable affine unit warps the triangular patch. It is scalable in the sense that it uses more cascaded units with the increase of the mesh hierarchical levels. The block diagram for the affine unit is shown in Fig. 15. The unit consists of a queue to store the mesh at level . The triangular patch is split into four patches and subsequently stored in the queue. If the patch
Fig. 14. Motion-compensation FSM.
BADAWY AND BAYOUMI: ALGORITHM-BASED LOW-POWER VLSI ARCHITECTURE
Fig. 15.
Block diagram for scalable affine unit.
Fig. 16.
Patch warping unit. Fig. 17.
needs to be triangulated, the patch will be forwarded to another affine unit. The structure of the warping unit is shown in Fig. 16, and the steps of evaluating the constant are shown in Fig. 17. The operation of the triangular texture warping is summarized in Figs. 18 and 19. The main unit warps the triangular patches (by mapping the texture). The unit uses an affine transformation defined as follows:
The affine transformation can be calculated after the con– are evaluated. stants Instead of using four multiplications, and four additions for ) pair in the two equations above, a simple calculating the ( scan-line algorithm can be used. It uses only two additions per ) pair. The algorithm is based on the pixel to generate the ( fact that the pixels are integers and can be evaluated using the values of neighboring pixels as in the following equations:
This unit calculates the affine mapping parameters , , , and . These constants can be evaluated at the three nodes
233
The affine unit FSM.
of the triangular patch. The general equations for are as follows: and
,
,
,
where , and are the motion vectors, and and are the locations of the vertex . To simplify the evaluation of these parameters, the triangular patch is translated to the origin 0,0 where the parameters are evaluated and the texture is mapped. This is followed by inverse translation to its correct position. The parameters will be as follows:
234
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002
Fig. 21.
Adaptive structured triangulation.
Fig. 22.
Adaptive structured mesh code.
Fig. 23.
Memory serialization unit.
Fig. 24.
Timing diagram for the memory serialization unit.
Fig. 18. Triangular texture warping FSM.
Fig. 19.
Texture warping FSM.
Fig. 20. Initial triangular mesh.
where . Since is selected to be to the power of 2, the division can be replaced by a shift operation. D. The Mesh Code Buffer The mesh code buffer accepts the adaptive structured mesh. The mesh initially is a regular triangular mesh. Each triangular patch is split into hierarchical triangulation. This triangulation is represented a sequence of “1” and “0”, where each “1” represents a split (Fig. 22). Each split is associated with three motion vectors that represent the motion of the three nodes. The structured hierarchical mesh is a technique borrowed from 3-D modeling for mesh construction. Fine-to-coarse and coarse-to-fine are two approaches used to define the hierarchical mesh representation. The former generates a fine mesh, removes selected nodes from a full-resolution mesh, and generates the hierarchical mesh. In the latter, the number of nodes and triangles increases gradually based on a specified threshold function. The adding of nodes is based on successive triangulation to minimize the mesh coding. In this paper, the coarse-to-fine mesh method is used.
The details of the mesh construction are as follows: The initial coarse regular triangular mesh is shown in Fig. 20. For each triangular patch, a successive triangulation splits the patch into four triangles at a single instance. The successive triangulation is shown in Fig. 21. In this approach, an adaptive triangulation generates four triangular
BADAWY AND BAYOUMI: ALGORITHM-BASED LOW-POWER VLSI ARCHITECTURE
Fig. 25.
235
Prototyping steps.
patches. Fig. 22 shows the coding, where “1” means that the mesh is split into three triangles and “0” means that there is no further splitting [4]. E. Memory Serialization Unit The memory serialization unit is used to serialize the memory writing. Write requests are issued by the parallel affine unit architectures. Fig. 23 shows the structure of the memory serialization unit for two parallel affine units. This structure can be are the extended to any number of parallel units. The memory address and data for the predicted frame. They are genis the data valid signal, which erated by the affine unit . is a indicates data needed to be written to the memory, and control line that serializes the parallel data to be transferred to the main memory. The prototype implements an 8-bit address bus. Fig. 24 shows the timing and 8-bit data with a 16-bit diagram for the memory serialization unit. It considers four parallel affine units. The writing process is scheduled among the signal, where . Notice that four units using the the memory write control signal MW is raised every other clock cycle, which allows the use of slow on-chip cache. Using QCIF format and 15 fps, 188 172 2 15 memory write cycles/s are needed. The minimum frequency will be 1.031 MHz. IV. ARCHITECTURE PROTOTYPE The architecture has been prototyped using 0.6- m CMOS technology with three layers of metal. The design methodology used in the prototyping is shown in Fig. 25. Synopsis tools are
used for the “front-end” steps, which capture and simulate the desired behavior of the architecture, while Cadence tools are used for the “back-end” steps. The behavior of the architecture is modeled using Verilog and is synthesized, simulated, and analyzed using Synopsis. Then, Cadence is used for schematic capture and modification using gate-level Verilog code to generate the final layout. The technology and symbol libraries used in both Synopsis and Cadence for synthesis and simulation are generated based on the standard-cell library. This guarantees the synthesized design has simulation results close to the final layout. Transferring the design from Synopsis to Cadence is accomplished by generating a Verilog netlist, which creates symbol and schematic views for all levels for the design hierarchy. Power and ground pads are then added to the top-level schematic. Two groups of simulations are used: one tests the correctness of the design at different design steps, and the other measures different performance parameters for the proposed architecture. Many verification simulation steps are involved in the design flow. The first uses Verilog simulation tools to verify the correctness of the architecture’s behavior. The second applied to confirm the functionality of the extracted gate-level circuit. The third ensures that the gate-level schematic, which has been imported and edited in Cadence, is equivalent to the proposed architecture. These are followed by the design rule checking (DRC) and the layout versus schematic (LVS), which are done with Cadence Diva. Once LVS is done, it is assumed that the layout is functionally correct. The layout of prototype of the architecture has a total area of 4807.20 4183.96 m . The
236
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4, APRIL 2002
TABLE I POWER AND DELAY OF DIFFERENT BUILDING BLOCKS
chain of scalable affine units. It reduces the power consumption by using a pipelined simple core that increases the throughput and the use of parallel units reduces the latency constraints. A brief comparison with some motion-estimation chips is given in Table II. The clock rate shown here is the working frequency for real-time application. From Table II, the design proposed here has superior performance on power, flexibility, and functionality over the previously proposed design. V. CONCLUSIONS
TABLE II COMPARISON OF SOME CORES
layout is tested at 3.3 V. The delay for the OVME and OVMC is 29.7 and 58.9 ns respectively while the power consumption is 589.58 mW for the OVME and 562.45 mW for the OVMC. The delay and power consumption as estimated by PowerMill and HSpice simulations for different building blocks are shown in Table I. The video-object motion-tracking architecture is based on a new algorithm for tracking a VO. It consists of two main parts: a VOME and a VOMC. The proposed architecture captures the dynamics of the video frames and introduces more nodes wherever there are different motion components. It generates a hierarchical adaptive mesh representation using a structured technique. The proposed architecture reduces the power consumption at the algorithm level and the arithmetic level. At the algorithm level, it uses the following. 1) A simple mesh construction algorithm, which reduces the complexity of the mesh generation unit compared to the Delanauy mesh approaches [1], [6]. 2) A coarse-to-fine mesh structure that can be represented using a count of patch splitting operations, which significantly reduces the bit count in mesh code. The reduction in the bit count reduces the processing time, latency and power consumption. 3) A real-time algorithm to estimate the motion of the mesh nodes. It uses the real-time three-step motion estimation, which processes a fewer number of blocks and consumes lesser power. Moreover, it allows the parallel processing, which reduces the latency of the algorithm. 4) A new multiplication-free algorithm for ping. The texture warping algorithm uses finite-state analysis and shift operations to reduce the complexity to only addition operations instead of multiplication operations. The use of only an adder leads to less power consumption than the use of multipliers. Moreover, the algorithm implements parallel threads in which each thread implements a pipelined
This paper presents a state-of-the-art algorithm-based low-power VLSI architecture for video mesh-based object motion tracking. The new architecture uses a novel hierarchical mesh-coding technique. It optimizes the mesh coding using an adaptive structured mesh to capture the contents of the object. The structured mesh offers a significant reduction in the mesh topology coding. The motion compensation is performed using a multiplication-free affine transformation, which significantly reduces complexity of the decoder architecture. Moreover, pipelining the affine unit contributes a considerable power savings. The architecture is prototyped using 0.6- m CMOS technology with three layers of metal. The performance results of the prototype show that the architecture can be used for online applications and the power consumption shows that it is suitable for mobile applications. REFERENCES [1] A. M. Tekalp, P. J. L. van Beek, C. Toklu, and B. Gunsel, “2D mesh-based visual object representation for interactive synthetic/natural video,” Proc. IEEE, vol. 86, pp. 1029–1051, June 1998. [2] L. Chiariglione, “MPEG and multimedia communications,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 5–18, Feb. 1997. [3] K. J. Ray Liu et al., “Algorithm-based low power and high-performance multimedia signal processing,” Proc. IEEE, vol. 86, pp. 1155–1202, June 1998. [4] W. M. Badawy, G. Zhang, and M. A. Bayoumi, “VLSI architecture for hierarchical mesh-based motion estimation,” in Proc. 1999 IEEE Workshop on Signal Processing Systems (SiPS), Taipei, Taiwan, R.O.C., Oct. 20–22, 1999. [5] G. Wolberg, Digital Image Warping. Los Alamatos, CA: Computer Soc. Press, 1990. [6] P. E. Eren, C. Toklu, and A. M. Tekalp, “Object-based manipulation and composition using 2D meshes in VRML,” in Proc. IEEE Signal Processing 1st Workshop Multimedia Processing, Princeton, NJ, June 1997, pp. 257–261. [7] W. Badawy and M. Bayoumi, “On minimizing hierarchical mesh coding overhead: (HASM) hierarchical adaptive structured mesh approach,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Istanbul, Turkey, June 5–9, 2000, pp. 1923–1926. [8] F. Rocca and S. Zanoletti, “Bandwidth reduction via movement compensation on a model of the random video process,” IEEE Trans. Commun., vol. COM-20, pp. 960–965, Oct. 1972. [9] MPEG-4 overview, ISO/IEC JTC/SC29/WG11 N1730. (1997, July). [Online]. Available: http://drogo.cselt.it/mpeg [10] P. J. L. van Beek, A. M. Tekalp, and A. Puri, “2D mesh geometry and motion compression for efficient object-based video representation,” in Proc. Int. Conf. Image Processing’97, Santa Barbara, CA, Oct. 1997. [11] S. Brofferio and F. Rocca, “Interframe redundancy reduction of video signals generated by translating objects,” IEEE Trans. Commun., vol. COM-25, pp. 448–455, Apr. 1977. [12] C. L. Huang and C.-Y. Hsu, “A new motion compensation method for image sequence coding using hierarchical grid interpolation,” IEEE Trans. Circuits. Syst. Video Technol., vol. 4, pp. 42–51, Feb. 1994.
BADAWY AND BAYOUMI: ALGORITHM-BASED LOW-POWER VLSI ARCHITECTURE
[13] S. A. Khan and V. K. Madisetti, “System partitioning of MCM for low power,” IEEE Design Test Comput. Mag., vol. 12, pp. 41–52, Spring 1995. [14] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumption in digital CMOS circuits,” Proc. IEEE, vol. 83, pp. 498–523, Apr. 1995. [15] H. K. Thapar and J. Cioffi, “A block processing method for designing high speed viterbi decoders,” in Proc. IEEE Communications Conf., June 1989, pp. 1096–1100. [16] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid-State Circuits, vol. 27, pp. 473–484, Apr. 1992. [17] T. Koga, K. Iinuma, A. Hirano, Y. Iijima, and T. Ishiguro, “Motion-compensated interframe coding for video conferencing,” in Proc. National Telecomm. Conf., New Orleans, LA, 1981, pp. G.5.3.1–G.5.3.5. [18] J. R. Jam and A. K. Jam, “Displacement measurement and its application in interframe image coding,” IEEE Trans. Commun., vol. COM-29, pp. 1799–1808, Dec. 1981. [19] T. H. Meng, R. Broderson, and D. Messerschmitt, “Asynchronous design for programmable digital signal processors,” IEEE Trans. Signal Processing, vol. 39, pp. 939–952, Apr. 1991. [20] K. I. R. Liu, S.-F. Hsieh, and K. Yao, “Systolic block householder transformation for RLS algorithm with two-level pipelined implementation,” IEEE Trans. Acoust., Speech. Signal Processing, vol. 40, pp. 946–958, Apr. 1992. [21] C.-L. Wang, “Bit-serial VLSI implementation of delayed LMS adaptive FIR filter,” IEEE Trans. Signal Processing, vol. 42, pp. 2169–2175, Aug. 1994. [22] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. [23] G. Fujita, T. Onoye, and I. Shirakawa, “A new motion estimation core dedicated to h.263 video coding,” Proc. IEEE Circuits and Systems, vol. 2, pp. 1161–1164, 1997. [24] S. K. Rao, M. Hatamian, M. T. Uyttendaele, S. Narayan, J. H. O’Neill, and G. A. Uvieghara, “A versatile and powerful chip for real time motion estimation,” in Proc. IEEE ISSCC, 1993, pp. 32–33. [25] K. Ishihara, S. Masuda, S. Hattori, H. Nishikawa, Y. Ajioka, T. Yamada, H. Amishiro, S. Uramoto, M. Yoshimoto, and T. Sumi, “A half-pel precision mpeg2 motion- estimation processor with concurrent three-vector search,” IEEE Solid-State Circuits, vol. 30, pp. 1502–1509, Dec. 1995. [26] A. Ohtani, Y. Matsumoto, M. Gion, H. Yoshida, T. Araki, A. Ubukata, M. Serizawa, K. Aoki, A. Sota, A. Nagata, and K. Aono, “A motion estimation processor for mpeg2 video real time encoding at wide search range,” in Proc. IEEE CICC, 1995, pp. 405–408. [27] J. F. Shen and L. G. Chen, “Low power full-search block-matching motion estimation chip for h.263 ,” Proc. IEEE Circuits and Systems, vol. 4, pp. 299–302, 1999.
+
Wael Badawy received the B.Sc. and M.Sc. degrees from the Department of Computer Science and Automatic Control Engineering, University of Alexandria, Alexandria, LA, and the M.S. and Ph.D. degrees from the Center for Advanced Computer Studies, University of Louisiana at Lafayette. He is currently an Assistant Professor in the Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, Canada. His research interests include VLSI architectures for video applications with low-bit rate applications, digital video processing, video library, watermarking, spatial information, low-power design methodologies, microelectronics, and VLSI prototyping. His research involves designing new models, techniques, algorithms, architectures, and low-power prototypes for MPEG consumer products. He has authored and co-authored more than 60 refereed journal/conference papers and about 20 technical reports. He is the Guest Editor for the Special Issue on System on Chip for Real-Time Applications in the Canadian Journal on Electrical and Computer Engineering. Dr. Badawy is the Technical Chair for the 2002 International Workshop on SoC for real-time applications, and a Technical Reviewer for several IEEE journals and conferences. He is currently a member of the IEEE Circuits and Systems Society’s Technical Committee on Communication. He was the recipient of the 2002 Petro Canada Young Innovator Award, the 2001 Micralyne Microsystems Design Award, the 1998 Upsilon Pi Epsilon Honor Society Award, and the 1998 IEEE Computer Society Award for Academic Excellence in Computer Disciplines.
237
Magdy A. Bayoumi received the B.Sc. and M.Sc. degrees in electrical engineering from Cairo University, Cairo, Egypt, the M.Sc. degree in computer engineering from Washington University, St. Louis, and the Ph.D. degree in electrical engineering from the University of Windsor, Canada. He has been a faculty member at the University of Louisiana at Lafayette (UL) since 1985. Currently, he is the Director of the Center for Advanced Computer Studies (CACS), the Department Head of the Computer Science Department, and the Edmiston Professor of Computer Engineering and Lamson Professor of Computer Science at the Center for Advanced Computer Studies, all at UL. His research interests include VLSI design methods and architectures, low-power circuits and systems, digital signal processing architectures, parallel algorithm design, computer arithmetic, image and video signal processing, neural networks, and wideband network architectures. He has edited and co-edited three books in the area of VLSI Signal Processing. Dr. Bayoumi was an Associate Editor of the IEEE TRANSACTIONS ON VERY LARGE-SCALE INTEGRATION SYSTEMS, the IEEE TRANSACTIONS ON NEURAL NETWORKS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–II, and for Circuits and Devices Magazine. He is an Associate Editor of Integration, the VLSI Journal, and the Journal of VLSI Signal Processing Systems. He is a Regional Editor for the VLSI Design Journal and is on the Advisory Board of the Journal on Microelectronics Systems Integration. He was the Vice President for the Technical Activities of the IEEE Circuits and Systems (CAS) Society, and is the Chairman of the Technical Committees on CAS for Communication and for Signal Processing Design and Implementation. He was a founding member and Chairman of the VLSI Systems and Applications Technical Committee, is a member of the Neural Network and Multimedia Technology Technical Committees, has been on the Technical Program Committee for ISCAS, and was the Publication Chair for ISCAS’99. He was the General Chairman of the 1994 MWSCAS and is a member of the Steering Committee of this symposium, and was the Co-Chairman of the Workshop on Computer Architecture for Machine Perception (1993) and is a member of the Steering Committee of this workshop. He was the General Chairman for the 8th Great Lake Symposium on VLSI (1998) for the Workshop on Signal Processing Design and Implementation (2000). He served on the Distinguished Visitors Program for the IEEE Computer Society (1991–1994) and is on the Distinguished Lecture Program of the CAS Society. He is the Faculty Advisor for the IEEE Computer Student Chapter at UL, and was the recipient of the university’s Distinguished Professor Award in 1993 and the Researcher of the Year Award in 1998.