This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Improved SIMD Architecture for High Performance Video Processors Wing-Yee Lo, Daniel Pak-Kong Lun, Member, IEEE, Wan-Chi Siu, Senior Member, IEEE, Wendong Wang, and Jiqiang Song, Senior Member, IEEE Abstract—SIMD execution is in no doubt an efficient way to exploit the data level parallelism in image and video applications. However, SIMD execution bottlenecks must be tackled in order to achieve high execution efficiency. We first analyze in this paper the implementation of two major kernel functions of H.264/AVC namely, SATD and subpel interpolation, in conventional SIMD architectures to identify the bottlenecks in traditional approaches. Based on the analysis results, we propose a new SIMD architecture with two novel features: (1) parallel memory structure with variable block size and word length support; and (2) configurable SIMD structure. The proposed parallel memory structure allows great flexibility for programmers to perform data access of different block sizes and different word lengths. The configurable SIMD structure allows almost “random” register file access and slightly different operations in ALUs inside SIMD. The new features greatly benefit the realization of H.264/AVC kernel functions. For instance, the fractional motion estimation, particularly the half to quarter pixel interpolation, can now be executed with minimal or no additional memory access. When comparing with the conventional SIMD systems, the proposed SIMD architecture can have a further speedup of 2.1X to 4.6X when implementing H.264/AVC kernel functions. Based on Amdahl’s law, the overall speedup of H.264/AVC encoding application can be projected to be 2.46X. We expect significant improvement can also be achieved when applying the proposed architecture to other image and video processing applications. Index Terms—Configurable SIMD, Parallel memory structure, SIMD bottlenecks, video codec processor
I. INTRODUCTION
W
ith the extensive use of image and video information in modern computer applications, the development of high performance image and video processing units has attracted Manuscript received September 23, 2009; revised April 27, 2010 and December 13, 2010. This work was supported in part by the Hong Kong Polytechnic University under grant no 1-BB9B. Most of the research work and implementation development were done in the Hong Kong Applied Science and Technology Research Institute (ASTRI) and Beijing SimpLight Nanoelectroinics Ltd. Wing-Yee Lo, Daniel Pak-Kong Lun and Wan-Chi Siu are with the Centre for Signal Processing of the Department of Electronic and Information Engineering of the Hong Kong Polytechnic University, Hung Hom, Kowloon Hong Kong. (e-mail:
[email protected]; enpklun@ polyu.edu.hk;
[email protected]). Wendong Wang is with the SimpLight Nanoelectronics Ltd., Beijing, China. (e-mail:
[email protected]). Jiqiang Song is with the Intel Lab, Beijing, China (e-mail:
[email protected]).
much interest from both academic researchers and VLSI system designers. Among the image and video processing operations that are performed in general computer applications, video coding is the most computation intensive operation that is often used as the benchmark to measure the performance of a video processor. For the rest of this paper, we shall focus on the realization of the state-of-the-art video coding standard H.264/AVC [1] and use it as an example to illustrate the merit of the proposed video processor design. To deal with the extremely high computational complexity of video coding, one common approach is to exploit the data level parallelism (DLP) in the execution. As different from application specific ASIC designs, a general purpose video processor should provide great flexibility for programmers while exploiting the parallelism in the execution. For this reason, the Single Instruction Multiple Data (SIMD) architecture is most suitable and is widely adopted. Two popular examples are Intel’s MMX/SSE1/SSE2/SSE3 [2] and Motorola’s AltiVec [3], where multimedia SIMD instruction set extensions have been added for efficient realization of video processing applications. In recent years, many researchers studied how much performance can be gained after using SIMD instructions in modern video codec [4]-[9]. Simulation results using reference model demonstrate that there is at least 2-12X speedup. A basic requirement to employ SIMD instructions is to possibly feed multiple data elements perfectly into vector registers so that the same computation operation can be applied. Although much research effort [6] [10]-[12] has been made to address the problem, there are often overheads and performance bottlenecks when aligning the multiple data to feed into vector registers. Extra memory loads and stores, unpacking, packing and shuffling are often required that prevent SIMD execution from achieving the peak performance. Besides, the memory mis-alignment, stride memory access, memory latency, random register file access and branch mis-prediction also prevent the processor from fetching data in a timely fashion to achieve peak throughput [13]-[15]. To address the aforementioned problems, our team has designed and implemented a new SIMD based video processor with architecture as shown in Fig. 1. Our video processor is a 5-stage pipeline multi-threaded multi-issue semi out-of-order superscalar processor. It supports a maximum of 4 threads of execution simultaneously. A maximum of 4 instructions can be
Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to
[email protected].
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < interleaved and loaded into multiple memory modules sequentially. In [16], a modulo addressing mode was introduced to allow part of the bytes in a word to be accessed from both ends of a circular buffer to reduce external memory bandwidth. Chang et al. [17] proposed adding one extra memory module in addition to the number of ALUs in SIMD processor to solve the possible memory module conflict problem. However, the number of memory modules must be relatively prime to the supported stride values resulting in larger hardware cost in address generation and shuffling logic. In [18], a scalable data alignment scheme was proposed for rectangular block data access using simpler memory address generation. It is achieved by using a two-dimensional notation for both pixel location and memory module number. However, it is not flexible enough to support variable block sizes. It is noted that a block based data access approach is often used in many image and video processing applications while the block size can be different for different algorithms. Flexibility should be provided when designing a general purpose video processor to allow data access of variable block size and word length without greatly increasing hardware complexity. In [19]-[20], a video signal processor with read-permuter and write-transposer placed, respectively, before and after the vector register file was described. They facilitate data reorganization in SIMD register before execution, but it still needs N cycles to do an NxN transpose operation. Seo et al. [21] on the other hand introduced diagonal memory organization and programmable crossbars in their SIMD architecture. The diagonal memory organization allows the horizontal and vertical memory access without any conflict. Due to data access complexity in H.264 algorithm, 3 programmable crossbar shuffle networks are added such that any data shuffle patterns required by H.264 algorithm can be supported. However, in order to accommodate complex data access patterns, only predefined fixed pattern crossbars are implemented. This limitation requires the crossbar patterns to be pre-designed based on the algorithm. They may not be flexible enough to realize future algorithm enhancement or support new video coding standards efficiently. Besides, the 3 shuffle networks make the SIMD pipeline longer which may increase the branch mis-prediction penalty and execute-to-consume latency between pipeline stages. Another deficiency of the traditional approaches is that they do not have direct support to major kernel functions in image and video processing. It will be discussed in next section. III. ANALYSIS As mentioned above, we use video coding as an example to illustrate the deficiency of the traditional SIMD architectures in supporting image and video processing kernel functions. It is well known that motion estimation is the most computation intensive function in H.264/AVC encoders. It contributes to more than 50% among all computations [5][22]. If four reference frames are used, motion estimation alone accounts for more than 70% of computation [22]. The next intensive
3
function is DCT/IDCT that contributes to about 10-20% of computation. For H.264/AVC decoders, the most intensive functions are interpolation and inverse transform. They contribute to about 20% and 5-10% of computation respectively [7][11]. For intra-frame coders, the most complicated functions are SATD transform for cost generation and mode selection, intra prediction and DCT/Q/IDCT [23]. They contribute to about 57%, 20% and 16% of computation respectively. Therefore, if we can enhance the SIMD execution in motion estimation, transform and mode decision, the overall performance can be improved significantly. Among these functions, SAD, SATD, DCT/IDCT, and subpel interpolation are the main targets. In this section, we analyze two video encoding kernel functions in detail in order to demonstrate where the conventional SIMD architectures can be further enhanced. Example codes in VideoLAN X264 opened source [23] are used to illustrate our findings. The VideoLAN X264 source uses the most popular Intel MMX/SSE1-3 instructions to realize SIMD functions. A. 4x4 Block SATD We first analyze the 4x4 SATD function in H.264/AVC. The function comprises several smaller sub-functions: memory load, subtraction, two-dimensional (2-D) Hadamard transform, transpose and summation. We went through the source codes of SATD in VideoLAN X264 source [23]. The numbers of instructions used to complete these sub-functions with different block sizes are listed in TABLE II. The operations under the “Others” column are improved more easily by other techniques TABLE II INSTRUCTION COUNT BREAKDOWN OF SATD, IDCT AND DCT IN VIDEOLAN X264. Instruction Counts MMX Block or Memory 1-D 4x4 1-D 4x4 Size Subtraction Transpose Others Total SSE2 Load Transform Transform
4x4 4x8 8x4 8x8 8x16 16x8 16x16
8 16 8 16 32 32 64
4x4
4
4x4
4
4x4 8x8 16x16
8 16 64
4x4 8x8 16x16
8 24 96
SATD4x4 (pixel_satd_ functions) 12 12 12 12 19 24 24 24 24 38 12 12 18 12 19 24 24 36 24 38 48 48 72 48 76 48 48 72 48 76 96 96 144 96 152 DCT4x4DC (idct4x4dc functions) 0 12 12 12 9 IDCT4x4DC (idct4x4dc) 0 12 12 12 0 DCT4x4Residual (sub_dct functions) 12 14 12 14 0 24 28 36 28 3 96 112 144 112 9 IDCT4x4Residual (add_idct functions) 0 15 12 15 18 0 30 36 30 38 0 120 144 120 140
75 150 81 162 324 324 648
MMX MMX SSE2 SSE2 SSE2 SSE2 SSE2
49 MMX 40 MMX 60 MMX 135 SSE2 537 SSE2 68 MMX 158 SSE2 620 SSE2
such as enhancing SIMD instruction set extension. Hence they are not discussed here. It can be seen that in VideoLAN X264 source, the number of SIMD instructions used to realize memory load, subtraction, 2 1-D Hadamard transform and transpose contribute to about 75% of the total instructions in
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4x4 SATD function for different block sizes. In fact, as can be seen in TABLE II, these sub-functions are equally important in functions such as DCT and IDCT. Their efficient realization obviously is decisive to improve the overall performance. Although these sub-functions are very simple, conventional SIMD architectures often cannot achieve the peak throughput due to the following 4 reasons: 1. lack of memory block load with different data length support; 2. limited support for data shuffling; 3. requirement of carrying out the same operations by all ALU in SIMD for each SIMD instruction execution; and 4. inability to support cross bank data access in a SIMD register file. In VideoLAN 4x4 SATD function, MOVD instruction is used to load 4 pixel data bytes from memory to lower double word of the 64-bit MMX register while filling the upper double word with zeros. Two PUNPCKLBW instructions are then used to unpack 8 data bytes from two lower double words of
Fig. 2. Packed word subtraction from packed byte.
MMX registers, 4 in each, into the destination register (see Fig. 2). The instructions convert four packed data bytes to four packed words before subtraction. The unpack instructions prevent the execution result from being overflowed in subsequent operations. It can be seen that the number of cycles to just perform memory load and subtraction can consume more than 30% of the total execution cycles of the sub-functions. This inefficient SIMD execution can be improved by loading data bytes from memory, extending them to data words before writing the packed words into the register. Conventional SIMD architectures often have limited support for data shuffling. As can be seen in TABLE II, the number of instructions for the implementation of matrix transpose can be as high as 22% of the total instructions for computing SATD and other kernel functions. Note that a matrix transpose
Fig. 3. Basic operations of 4x4 matrix transpose.
4
involves no arithmetic operations but only data shuffles. Most of these instructions are not required if a dedicate hardware construct is provided for data shuffling. In fact, data shuffling is required in many other parts of a video codec which further justifies the need for an efficient data shuffling unit. Fig. 3 shows the basic operations as carried out by VideoLAN X264 source for implementing matrix transpose of a 4x4 block. Since there is not a dedicate hardware for matrix transpose in MMX, the most efficient way to perform transpose is to use different unpack instructions, which include PUNPCKLWD, PUNPCKHWD, PUNPCKLDQ and PUNPCKHDQ. It is seen that 8 instructions are required to implement the matrix transpose, of which most of them are unnecessary if a dedicated data shuffling unit is available. In the actual codes of VideoLAN X264 source, 12 instructions instead of 8 are used for each 4x4 block matrix transpose. Extra instructions are required to store the temporary results generated in the computation due to insufficient number of registers. Since most operations in H.264/AVC are performed in block mode, it is obvious that the efficiency of SIMD operations can be significantly uplifted by having all data in a block loaded into the register before the SIMD operations take place. Assume the bit-width of the registers is large enough such that all data in a block can be loaded into a register. Intuitively we expect more data can be processed at the same time. However, it is not the case since very often different data in a block may need to perform slightly different operation. More commonly, data of a block may need to work with other data in a block. Let us take the computation of the 2-D Hadamard transform in SATD as an example. The transform can be realized by applying 1-D Hadamard transform to all columns and then all rows of a data block. A length-4 1-D Hadamard transform is defined as: Y H.X (1) where X is the input 4x4 data block and Y is the transformed output. H is the transform matrix and is given by 1 1 1 1 1 1 1 1 H 1 1 1 1 1 1 1 1 .
Its application to the columns of a 4x4 data block can be implemented with the steps as shown in (2)-(7): A(i, j) = X(i, j) + X(i+1, j) (2) i = 0, 2, j = 0-3. (3) B(i, j) = X(i, j) – X(i–1, j) i = 1, 3, j = 0-3.
Fig. 4. Basic operations in 1-D Hadamard transforms.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < further operations take place. However, traditional memory storages only allow sequential memory access. For this reason, multi-bank or parallel memory structures were proposed to allow multiple data access concurrently [16]-[20]. Similar to the previous approaches, the proposed SIMD architecture is equipped with a 32KB parallel memory structure served as a buffer between the external memory and the register file as shown in Fig. 7. The parallel memory is divided into 16 modules each of which has the size of 2K bytes and has a separated data bus connected to one of the 16 banks of a register file. Each register bank has 32 rows and each element of a register bank can store a 16 bits word (in fact, the register file is constructed by 32 256-bit registers). External Memory
16 banks x 2048 x 8 bits Parallel Memory Structure
: :
6
similar to that in Fig. 9b except that the memory module assignment is transposed). However it can be seen that the required data may be stored in different physical addresses (pixels occupies across different dotted boxes). An efficient address generator is needed to determine the required physical address for each memory module. In fact, the data loading from external memory to internal memory modules is performed by a direct memory access (DMA) unit following mapping functions as described below. Let As be the starting address of the part of a video frame to be loaded into the parallel memory and Af be the address of a pixel
16 banks x 32 x 16 bits Register File & 16 ALUs
(a)
(b)
Fig. 9. Memory interleave to allow block access. Fig. 7. Proposed parallel memory structure.
Fig. 8 shows the relationship between the logical offset address and physical address we have defined in the proposed parallel memory structure. In the proposed architecture, the logical address is unique for a memory location and the physical address is the real address generated to every memory module for data access. The data from external memory are interleaved and loaded to the internal parallel memory modules such that when accessing a data block, only one data from each memory module needs to be retrieved, no matter where the block is located. Fig. 9 shows how the pixels of an image are stored in different memory modules to facilitate data retrieval of different block sizes. In the figure, the characters grouped inside a dash square block refer to the 16 memory modules that can be accessed by the same physical address. The numbers 0-9 and letters a-f denote the memory bank number (a-f stands for bank 10-15 respectively). To show that there is no access conflict when loading a block of data from the parallel memory to the register, a few examples are shown in Fig. 9. The characters grouped inside a solid square block refer to the data blocks to be retrieved to the register. It can be seen that for both 4x4 (Fig. 9a) and 8x2 (Fig. 9b) block accesses, one data will be retrieved from each parallel memory module no matter where the block is located (the data loading method for 2x8 block is
Fig. 8. Parallel memory logical offset address and physical address.
within that part of the video frame. Then A f A s A off
(8) where Aoff is the offset of the address of that pixel from the starting address. Assume that the video frame has the size of Nx columns and Ny rows. Then Aoff can always be written as (9) for x = 0 to Nx-1 and A off yN x x y = 0 to Ny-1. Alternatively, the index x and y can be obtained from Aoff by: y A off / N x and x A off (10) N x
where . is the floor function and a
b
stands for a modulo b.
Let {m, p} be the module number and the physical address respectively of the parallel memory structure as shown in Fig. 8. By inspecting Fig. 9, the mapping functions that the DMA unit should use for loading the data to the parallel memory are: For 4x4 block loading, p y / 4 * N x / 4 x / 4 (11) m x 4 4* y 4 (12) For 8x2 block loading, p y / 2 * N x / 8 x / 8 m
x
8
8*
y
2
(13) (14)
Similarly, for 2x8 block loading, p y / 8 * N x / 2 x / 2 m x 2 2* y 8
(15)
(16) Once the block size is known, the DMA unit will load the data from external memory to the parallel memory following the respective mapping functions. Then data can be retrieved from the parallel memory to the register efficiently. Assume that a 4x4 block with indices of the first pixel be {xs, ys} is to be retrieved. The pixels in the block can be described by {xs+xo, ys+yo}, where xo, yo = 0 to 3. Following from (11), p ( y s y 0 ) / 4 * N x / 4 ( x s x 0 ) / 4 (17)
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Let y s y s ' y s
and x s x s ' x s
4
such that y s ' and xs '
4
will be always divisible by 4. (17) can then be written as p y s ' / 4 y s 4 y o / 4 * N s / 4
xs ' / 4 xs
4
xo / 4
(18)
The two floor functions in (18) can only be equal to 0 or 1. Therefore, for any 4x4 block with indices of the first pixel be {xs, ys}, all data of the block are stored in at most 4 different physical addresses only in the parallel memory. So if the first data is stored in ps, the rest must not be stored in address other than {ps+1, ps+Nx/4, ps+Nx/4+1}. More specifically, let q y ( y s 4 y 0 ) / 4 and (19) q x ( x s x 0 ) / 4 then
negligible as far as a video processor is concerned. The above derivation can also be applied to 2x8 and 8x2 block retrieval: a. 8x2 block access q y ( ys 2 m / 8 ys 2 ) / 2 q y ' / 2 and (30)
m x
4
m xs
4* y 4
4
4
x s xo
x s xo
4
4
xs
4* y 4
4
(21)
4
xo
(22)
Note that m = 0 to 15 is the index to the 16 parallel memory modules. Substitute (22) to (21), we have
m xs m xs y
4
m m
Hence ys yo yo
4
4
4
4 4 4
4* y
/ 4 m / 4
4
m 4 4* y
ps p 1 s ps N x / 8 ps N x / 8 1
q y ( ys
m / 4 y s
4
(26)
4
4
m / 4 y s
q x q x ' 4 where q x ' x s
4
m xs
4 4
4 4
(27)
(28)
Since qx and qy must be equal to {0, 1}, (20) can be written as: ps
q y ' 4
and
q x ' 4
ps 1
q y ' 4
and
q x ' 4
ps N x / 4
q y ' 4
and
q x ' 4
ps N x / 4 1
q y ' 4
and
q x ' 4
(29)
The evaluation of qy’ and qx’ is very simple. The modulo 4
qx
s
8
8
) / 8 q x ' / 8
(31)
q y ' 2 and q x ' 8 q y ' 2 and q x ' 8 q y ' 2 and q x ' 8
(32)
q y ' 2 and q x ' 8
m ys
2
m / 8 x s
ps p 1 s ps N x / 2 ps N x / 2 1
8 8
) / 8 q y ' / 8 and
8
2
2
(33)
) / 2 q x ' / 2
(34)
q y ' 8 and q x ' 2 q y ' 8 and q x ' 2
(35)
q y ' 8 and q x ' 2 q y ' 8 and q x ' 2
For actual implementation, each parallel memory module is installed with an address generation unit as shown in Fig. 10 for the implementation of (29), (32), or (35) based on the selected block size. An additional address generation unit is responsible for pre-computing the 4 possible physical addresses in each case. A 4-to-1 multiplexer is installed in each address generation unit for selecting one of the supplied addresses based on the results of (29), (32), or (35). Add. Gen. Add. Gen. m=0
Add. Gen. m=1
(25)
q y q y ' 4 where q y ' y s
function .
m xs
b. 2x8 block access
xs ys bsNxdsdsb
Substitute (26) and (22) to (19), we can express qx and qy in terms of m, xs and ys as follows:
8
(23) (24)
m / 4
yo
4
( x
2
q x ( xs
4
q y 0 and q x 0 ps p 1 q y 0 and q x 1 s (20) p N / 4 q y 1 and q x 0 x s p s N x / 4 1 q y 1 and q x 1 The following further shows that we only need to have the indices of the first pixel {xs, ys}, we can easily determine the physical address for each module. Again we use the 4x4 block retrieval as an example. From (12), it can be seen that:
7
can be implemented by extracting the last 2 bits
of the number. m / 4 can be implemented by shifting m to right by 2 bits. The addition and subtraction can be implemented by a small adder. And finally the comparison between qy’ and qx’ with a constant 4 can be implemented by checking the output carrier bit of the small adder. All the above can be implemented by less than 20 logic gates, which is
Add. Gen. m=15
Total 16 Memory modules
xs ys bs
xs ys bs
xs ys bs
Note: bs block size; refer (36) for the definition of ds and dsb
Fig. 10. Address generation for parallel memory structure.
Fig. 11 shows an example of loading 352x40 pixels of a CIF image from external memory into 16 internal memory modules. Again we assume the 4x4 block access is selected hence the way of data loading follows (11) and (12). The numbers in the figure are the physical addresses. To load the 16 pixels of the upper 4x4 block in Fig. 11, physical addresses 174, 174, 174, 173, 174, 174, 174, 173, 86, 86, 86, 85, 86, 86, 86, 85 are generated by the address generation units installed with memory module 0 to 15, respectively, following (27), (28), and (29). The lower 4x4 box in Fig. 11 depicts the actual implementation when the block access crosses the last row (assume the last physical address in memory module is 879). It can be seen that the data access are “wrapped” back to the beginning of the buffer. The physical addresses generated for each memory module in this case are 3, 3, 3, 2, 795, 795, 795, 794, 795, 795, 795, 794, 795, 795, 795, 794. It is known that the dominant word lengths in video
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < look-up table named as configurable SIMD look-up table (CSLUT) is introduced. The CSLUT is made by 5 memory modules. Their logical and physical addresses as well as their structure are shown in Fig. 13. There are three major configuration data in the table. The 80-bit row address configuration data and 64-bit bank configuration data specify the register row addresses and register bank numbers of 16 operands to be retrieved respectively. The mux control unit takes in 64-bit bank configuration data from CSLUT so that each operand of ALU can be retrieved from any bank. The RF address control unit takes in 80-bit row address configuration data from CSLUT and generates 16 row addresses to each RF bank. If only the bank number configuration data in CSLUT is used, 16 operands on the same row from different bank specified in bank configuration data are retrieved. If only the row address configuration data in CSLUT is used, each ALU in SIMD takes one operand in any row address specified in row configuration data from its own RF bank. Using both row and
9
accessing a particular entry in CSLUT. The format of typical and CSIMD instructions is shown in Fig. 14 for comparison. In the figure, CMD is the instruction opcode. The MISC field specifies execution controls such as operand shift bits, zero or sign extension options, etc. For typical instructions, the register row addresses of two sources and one destination are specified by RS1 and RS2, and RD respectively. It requires all ALU to get two operands at row addresses RS1 and RS2 from their own register banks; and requires the ALU to write the execution result to row address RD of their own register banks. That is, typical instructions do not allow cross register bank data access, nor different row data access. For CSIMD instructions, the CSLUT address is specified in CSLUT_ADDR field. Each ALU can get one of the operands at any row address of any register bank specified in CSLUT. Furthermore, slightly different operations are allowed to be executed among the ALUs as mentioned above. They provide great flexibility that fully addresses the problems of traditional SIMD architectures as mentioned above.
Fig. 14. Instruction fields of a typical and CSIMD instruction. Fig. 13. CSLUT for configurable SIMD instruction.
bank configuration data, the operand in any row address can be retrieved from any bank to achieve random RF access. Besides near random RF access, the proposed configurable SIMD also provides MIMD-like execution support, i.e. it allows a minor difference in operation among ALUs. To accommodate this, a 16-bit miscellaneous column is introduced into the CSLUT for indicating the slightly different operations to be performed among the ALUs. For example, we can use this column to define whether an addition or subtraction is to be performed for each ALU in a SIMD processor. It is useful in many fast transform algorithms, including the Hadamard transform used in the 4x4 SATD function. It will be shown in next section. To access the table, a set of so-called CSIMD (Configurable SIMD) instructions is provided in the instruction set. These instructions have a particular field to store the address for
A methodology similar to configurable SIMD is also proposed in a patent application [25]. Compared to this patent, the proposed configurable SIMD has additional advantages. First of all, the configuration data in the patent is not stored in the SRAM based LUT but programmable logic array (PLA). Hence the extent of reconfiguration is limited. That is also why the patent design needs extra data called Pseudo Static Control Information (PSCI), in addition to configuration data retrieved in instruction field, to generate the reconfiguration data. The PSCI dictates the aspects of the functionality and behavior of the execution unit and crossbar interconnect. It cannot be dynamically reconfigured in cycle basis via instruction. Instead, a dedicated PSCI-setting instruction is used to update the PSCI data from time to time. On the other hand, the proposed configurable SIMD uses SRAM as configuration data storage which allows much larger extent of reconfiguration. The reconfiguration can be dynamically done in cycle basis by getting the look-up table entry address from the instruction.
Fig. 15. 4x4 1-D Hadamard transform using only two CSIMD instructions.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Beside, the crossbar in the patent design is controlled totally by PSCI data. It only provides shuffling on operands read from register file location specified by the source operand address instruction field. It cannot allow random register file access as the proposed approach. In the following subsections, we demonstrate how the implementation of H.264/AVC kernel functions is made simple by the new CSIMD structure. We particularly use SATD and fractional motion estimation as examples although similar improvement can also be achieved in other kernel functions such as intra prediction. To simplify our discussion, we assume that video data are accessed in the form of 4x4 blocks. Operations involving larger data blocks are composed by combining the results of the constituting 4x4 blocks. 1) SATD Computation: A SATD computation consists of data load, subtraction, 2-D 4x4 Hadamard transform, matrix transposes, taking absolute of the transformed data and summation. Let us first consider the realization of 2-D 4x4 Hadamard transform. As discussed above, a 2-D 4x4 Hadamard transform can be implemented by four length-4 1-D Hadamard transforms applied to the rows and followed by another four applied to the columns. Fig. 4 shows that at least 8 instructions are needed to perform each set of four 1-D Hadamard transforms using the MMX instruction set due to insufficient register bit-width. We have also shown in Fig. 5 that even if we have the resource to install registers with sufficient bit-width such that all data of a block can be loaded into a register, we still cannot easily implement the 1-D Hadamard transforms using SIMD instructions since different operations are performed in different register banks and they may require operands from different register banks. For the proposed SIMD architecture, we use only 2 CSIMD instructions to realize each set of four 1-D Hadamard transforms as shown in Fig. 15. Before execution, the 4x4 input data is placed in 256-bit SIMD register in, say, row 5. Each CSIMD instruction takes one operand from its own RF bank and one operand from other bank to perform either addition or subtraction. For example, in the first CSIMD instruction, the ALU0 (right one) takes data X00 from its own bank and data X10 in bank 4 of row 5 to perform addition. ALU 3 (the fourth one from right) takes data X10 from its own bank and data X00 in bank 0 of row 5 to perform subtraction. All configuration information is specified in row, bank and misc memory content
of the CSLUT. It makes use the misc configuration in CSLUT to specify whether addition (e.g. “1”) or subtraction (e.g. “0”) is performed. The complete configuration data in CSLUT to perform each set of four length-4 1-D Hadamard transforms is shown in TABLE III. Such feature provides great flexibility in program design and in turn leads to reduction in SIMD instructions in the program. The full crossbar switch also greatly enhances the performance of matrix transpose. Referring to TABLE II the instruction counts to perform a 4x4 block transpose in VideoLan X264 are 12 and 9 when using MMX and SSE2 respectively. With CSIMD, a 4x4 block transpose can be carried out in one clock cycle. It is actually one of the shuffling operations supported by the full crossbar switch. In fact, when actual implementing the SATD function, the matrix transpose operation is embedded into second 1-D 4x4 Hadamard transform. That is, we do not need to dedicate a CSIMD instruction to perform the transpose operation. It is done together with second 1-D 4x4 Hadamard transform. In overall, the proposed SIMD architecture takes only 4 instructions to perform the first two steps of a 4x4 SATD function (from memory load to 2-D Hadamard transform) while VideoLAN X264 takes 56 instructions to do the same. In fact the proposed CSIMD structure can also greatly benefit the implementation of a few other similar functions of H.264/AVC and MPEG4 such as 4x4 IDCT/DCT and 4x4 matrix multiplication. In both cases, the proposed SIMD architecture only takes 2 instructions to finish. 2) Efficient H.264/AVC fractional motion estimation: To compute the fractional motion estimation for a 4x4 block, it needs a maximum of 10x10 integer pixels, which can be loaded to 9 register rows, with row address 1 to 9 respectively, as shown in Fig. 16. The square boxes in the figure represent the integer pixels and the number inside the square box is the register bank where the pixel data is stored. The 6-tap filtering 6
7
4
5
6
7
4
9
a
b
8
9
a
b
d
row=1 e
f
c
row=2 d e
f
1
2
c d e f 3 0 0 1 1 2 2 3 3
5
6
0 1 2 3 7 4 4 5 5 6 6 7 7
9
a
d
row=4 e
1
2
4 5 6 7 b 8 8 9 9 a a b b 9 8 a b f c c d d e e f f row=9 3 0 1 2 3
TABLE III
5
6
7
4
5
6
7
5
5
6
8
9
a
c
row=3 d
e
0
1
2
ROW, BANK AND MISC CONFIGURATION IN CSLUT FOR THE IMPLEMENTATION OF THE HADAMARD TRANSFORMS
9
a row=6
b
8
9 a row=7
b
8
4
c
d
0
5
6
8
9
a
row=11 0
1
c
row=5 d
e
0
1
2
5
6
9 row=8
a
5 4
4
5
5
4
c
ROW First BANK CSIMD MISC ROW Second BANK CSIMD MISC
f 5 b 0 6 7 0
e 5 a 0 6 6 0
d 5 9 0 6 5 0
c 5 8 0 6 4 0
b 5 f 1 6 3 0
a 5 e 1 6 2 0
9 5 d 1 6 1 0
8 5 c 1 6 0 0
7 5 3 0 6 f 1
6 5 2 0 6 e 1
5 5 1 0 6 d 1
4 5 0 0 6 c 1
3 5 7 1 6 b 1
2 5 6 1 6 a 1
1 5 5 1 6 9 1
0 5 4 1 6 8 1
8
X
X
X
X
X
+
b
a
b f
e
e
f
f
X
X
X
row=4 X
X
+
row=10 row=0
b b
e d
4 3 2 1 0 row=9
8 X
7 b
a
9 d
7
7
a a
d c
6 5
9
9
8 c
3 7
3 2 1
row=2 X
6 row=10
3
3
row=0 2 6
9 8
3 2
2
4
8 8
f
2 1
1
4 4
e
1 0
0
c
ALU
10
0 c avg
row=11
0
Fig.16. Subpel interpolation by CSIMD.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < operation for integer to half interpolation is done by six multiply-and-accumulate (MAC) instructions. In Fig. 16 (upper left hand side), the solid line triangle half pixel c is generated by TABLE IV REGISTER ROW NUMBER AND BANK INFORMATION IN CSLUT FOR SUBPEL INTERPOLATION.
ALU 0 1 2 3 4 5 6 7 8 9 a b c d e f
Integer to Half (dotted)
Integer to Half (Solid)
BRBRBRBRBRBR 1 4 2 4 3 4 0 9 1 9 2 9 2 4 3 4 0 9 1 9 2 9 3 9 3 4 0 9 1 9 2 9 3 9 0 5 0 9 1 9 2 9 3 9 0 5 1 5 5 4 6 4 7 4 4 9 5 9 6 9 6 4 7 4 4 9 5 9 6 9 7 9 7 4 4 9 5 9 6 9 7 9 4 5 4 9 5 9 6 9 4 5 5 5 6 5 9 4 a 4 b 4 8 9 9 9 a 9 a 4 b 4 8 9 9 9 a 9 b 9 b 4 8 9 9 9 a 9 b 9 8 5 8 9 9 9 a 9 b 9 8 5 9 5 d 4 e 4 f 4 c 9 d 9 e 9 e 4 f 4 c 9 d 9 e 9 f 9 f 4 c 9 d 9 e 9 f 9 c 5 c 9 d 9 e 9 f 9 c 5 d 5
BRBRBRBRBRBR 8 2 c 2 0 9 4 9 8 9 c 9 9 2 d 2 1 9 5 9 9 9 d 9 a 2 e 2 2 9 6 9 a 9 e 9 b 2 f 2 3 9 7 9 b 9 f 9 c 2 0 9 4 9 8 9 c 9 0 7 d 2 1 9 5 9 9 9 d 9 1 7 e 2 2 9 6 9 a 9 e 9 2 7 f 2 3 9 7 9 b 9 f 9 3 7 0 9 4 9 8 9 c 9 0 7 4 7 1 9 5 9 9 9 d 9 1 7 5 7 2 9 6 9 a 9 e 9 2 7 6 7 3 9 7 9 b 9 f 9 3 7 7 7 4 2 8 2 c 2 0 9 4 9 8 9 5 2 9 2 d 2 1 9 5 9 9 9 6 2 a 2 e 2 2 9 6 9 a 9 7 2 b 2 f 2 3 9 7 9 b 9
Half to Quarter B R B R 0 0 c a 1 0 d a 2 0 e a 3 0 f a 4 0 0 a 5 0 1 a 6 0 2 a 7 0 3 a 8 0 4 a 9 0 5 a a 0 6 a b 0 7 a c 0 8 a d 0 9 a e 0 a a f 0 b a
multiplying integer pixels in row 2 of banks 4, 8, c, and integer pixels in row 9 of banks 0, 4 and 8 with 6 filter taps and summing the results up. It can be seen that the operations require nearly random access to different rows of different register banks. For instance, the circle quarter pixel 0 is interpolated from half pixels in row 10 of bank 0 and in row 0 of bank c, while the quarter pixel 5 is interpolated from half pixels in row 0 of bank 1 and row 10 of bank 5. In the lower part of Fig. 16, it shows how the half and/or quarter pixels are retrieved randomly in any banks of any rows before execution. Note that the multipliers and adders in the figure only show the operation it required to do interpolation for clarity. It does not represent the real hardware. Also, the second operand of FIR tag to the multiplier is not shown in the figure. As mentioned, the half to integer interpolation is performed by 6 MAC instructions. Each MAC takes one operand from location specified by CSLUT table before it is multiplied with a filter tag and then added to previous MAC results. While such random register access will introduce much difficulty to traditional SIMD executions, the proposed CSIMD structure handles them easily with the use of the CSLUT table and the crossbar switch. TABLE IV shows the related information stored in the CSLUT table required for the interpolation of the solid line and dotted line triangle half pixels as well as the circle quarter pixels in Fig. 16. The “B” and “R” columns refer to the register bank and the row number of the pixels to be retrieved and sent to the ALU to perform one MAC operation. For each quarter pixel interpolation, a CSIMD instruction will be issued and the required entries in the CSLUT table will be retrieved. The related register access information will be sent to the register file and the crossbar switch. With the help of the crossbar switch, one of the operands required in the interpolation can be obtained from any row of any register bank. The whole
11
fractional motion estimation can be evaluated efficiently without extra memory load store, as well as the redundant packing and unpacking operations. V. EXPERIMENTAL RESULTS Extensive simulations have been performed to evaluate the performance of the proposed CSIMD architecture in two aspects: memory accesses and cycle counts for computing major H.264 kernel functions. To evaluate the performance in memory accesses, two Baseline Profile C models were used in our experiments for comparison. One is the Optimized JM Encoder which is optimized from JM7.4 reference model 0 by removing Main Profile features, dynamic memory allocation and release, and rate-distortion optimization. The other one is our CSIMD H.264 Encoder which is based on the Optimized JM Encoder and further enhanced by using all proposed features described in this paper namely, the advanced parallel memory structure with variable block size and word length support and the CSIMD structure that allows nearly random register access. We use the number of memory accesses as a yardstick for performance evaluation because they directly affect, to a large extent, the overall computation time. The proposed CSIMD H.264 Encoder is equipped with a 16-module parallel memory structure plus efficient address generation units. The memory accesses here refer to the accesses to the parallel memory. Note that for the proposed CSIMD H.264 Encoder the data access to external memory are achieved using a hardware DMA unit similar to other traditional parallel memory systems. Based on the above, the numbers of memory accesses for computing integer and fractional motion estimation (IME and FME) required by the two models are evaluated. The motion estimation is done on a CIF resolution image. TABLE V shows the results we obtained in the simulation. In the table, LS and VLS stand for the number of load/store and vector load/store instructions, respectively. Since our algorithm uses a bottom-up approach, the vector LS in CSIMD Encoder mainly refers to 4x4 block load or store in our simulation. Note that other block sizes, such as 2x8 or 8x2, can also be easily implemented using the proposed parallel memory structure and the address generation unit. The number of instructions required for loading or storing a 4x4 block by Optimized JM Encoder varies from block to block. It depends on whether the block is aligned in memory. Since one vector LS instruction can replace 16 scalar LS instructions at most, if the Optimized JM Encoder and the CSIMD Encoder are different only in the parallel memory structure, the scalar LS instructions required by the Optimized JM Encoder (Opt.JM_LS) should be close to TABLE V MEMORY ACCESS AND INSTRUCTION COUNT REDUCTION. Optimized JM CSIMD Encoder LS Instr. Cnt LS VLS LS+(16*VLS) Instr. Cnt IME 10,565,010 35,446,064 476,452 507,295 8,593,172 3,668,072 FME 23,956,652 96,534,062 166,954 102,155 1,801,434 789,070 ME
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < the sum of the scalar LS (CSIMD_LS) instructions and 16 times vector LS (CSIMDvec_LS) instructions required by the CSIMD Encoder. However, it can be seen in TABLE V, Opt.JM_LS >> CSIMD_ LS + (16*CSIMDvec_LS)
(37)
It is particularly true in fractional motion estimation. It shows that while the parallel memory structure can help to reduce the memory access, the introduction of other features in the CSIMD Encoder, in particular the “random” register access feature, gives a further amount of saving in memory access. It is especially the case for fractional motion estimation. In fact when using the proposed SIMD architecture for computing motion estimation, less than 10% SIMD instructions in integer motion estimation are CSIMD instructions, while more than 90% SIMD instructions in fractional ME are CSIMD instructions. This explains why the improvement for fractional ME is so significant. As a result, the total number of memory TABLE VI EXECUTION CYCLES SPEEDUP VERSUS VIDEOLAN X264. Functio SATD4x4 n DC Block 4x 4x 8x 8x 8x1 16x 16x1 4x 4 8 4 8 6 8 6 4 Size Speedup 2.9 4.6 2.6 2.4 2.5 2.5 2.3 2.7
DCT4x4 IDCT4x4 Residual DC Residual 4x 8x 16x1 4x 4x 8x 16x1 4 8 6 4 4 8 6 2.6 2.4 2.7 2.1 3.5 3.3 3.7
access for integer motion estimation is reduced by ~10.7 times, and that for fractional motion estimation is reduced by ~89.0 times. The table also shows that the total numbers of instruction counts to perform the integer and fractional motion estimation are reduced by ~9.7 and ~122.4 times comparing with the Optimized JME Encoder. To give an idea of how the proposed SIMD architecture compares with the state-of-the-art SSE/MMX SIMD architecture, the execution cycles to perform 4x4 SATD and IDCT/DCT using the proposed CSIMD Encoder model and VideoLAN X264 are estimated. We developed a performance simulator to emulate our CSIMD Encoder. The simulator is a cycle-accurate model. Since there is no VideoLAN X264 performance simulator, we modified our performance simulator to emulate the SSE/MMX instructions in VideoLAN. In TABLE VI, the speedup by using the proposed CSIMD Encoder as compared with VideoLAN X264 for the computations of SATD and IDCT/DCT of different block sizes is shown. It is seen that an improvement of 2.1X to 4.6X can be achieved. Note that the speedup of SATD for block size 4x8 is exceptionally high. It is because Intel’s SSE/MMX does not support stride load so that one row of 4 pixels from each upper and lower 4x4 blocks inside the 4x8 block can be loaded into TABLE VII CYCLE COUNT REDUCTION FOR IMPLEMENTING SOME H.264 KERNEL FUNCTIONS WHEN ENCODING 1 SECOND OF CIF SEQUENCE. Function I Frame Times / Frame P Frame Cycle Count Reduction / Second (percentage)
SATD DCT4x4RES IDCT4x4RES 75,655 11,120 4,952 70,916 6,480 1,694 95,992,506 6,498,960 2,645,264 (65.6%) (61.9%) (71.6)
12
one SSE register in VideoLAN. Hence the two 4x4 blocks can only be performed separately in MMX register. TABLE VII further shows the simulation results when computing SATD and IDCT/DCT in a H.264 encoding process. In this simulation, one second of video sequence Stefan (25 frames, 1I+24P) with CIF resolution was used. The number of cycle count reduction by using the proposed CSIMD Encoder model as compared with SIMD implementation using VideoLAN X264 source codes is shown. It can be seen that more than 60% of execution cycles can be reduced using the proposed CSIMD Encoder model. All improvement as mentioned above stems from the advanced parallel memory and CSIMD structures. Based on the Amdahl’s Law [28], we can project the speedup of the entire H.264/AVC encoding application from the kernel function speedup, with respect to adopting the proposed parallel memory structure and configurable SIMD feature in conventional SIMD architecture. Let T be the execution time (measured in execution cycles) of the original H.264/AVC encoding application, Tker be the execution time of the kernel function and Tcsimd be the execution time of the kernel function performed by our CSIMD Encoder. Amdahl’s Law states that the overall speedup of the application S is: T 1 S (38) T Tker Tcsimd 1 ( / s ) where Tker / T is the percentage proportion of the kernel function in the entire application and s Tker / Tcsimd is the speedup of the kernel function execution with respect to our proposed features. It is easily to extend the overall application speedup if there are multiple kernel functions as below: 1 S (39) 1 ( i i i i / s i ) Several kernel functions are taken in our calculation. They include integer motion estimation (IME), fractional motion estimation (FME), SATD, DCT and IDCT. TABLE VIII shows the kernel functions’ speedup and their corresponding percentage proportion in application based on our profiling result. The speedup of IME and FME mainly comes from the instruction count reduction shown in TABLE V which is 9.7 and 122.4 respectively. It should be noted that the SATD in this table only refers to inter mode decision but not in motion estimation because SATD speedup is already accounted in FME speedup. The speedup of SATD, DCT and IDCT is from TABLE VI. According to equation (39), the overall speedup of H.264/AVC encoding application is 2.46X. TABLE VIII PROPORTION AND SPEEDUP OF KERNEL FUNCTIONS. Kernel Proportion (%) Speedup
IME 12 9.7
FME 33 122.4
SATD 7 2.9
DCT 7 2.7
IDCT 13 2.1
Besides video coding functions, the new SIMD architecture is very generic and flexible that is also useful to many other image and video applications. To illustrate this, we have
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < applied the proposed SIMD architecture to the implementation of several general video and image processing functions (e.g. de-interlacing, scaling, transform, color space conversion, etc.). Due to the flexibility provided by the proposed parallel memory structure, we can support image and video applications of different block sizes and word lengths. And by redefining the CSLUT table entries, we can realize these applications efficiently using the CSIMD instructions. TABLE IX shows the numbers of predefined entries in the CSLUT table for the implementation of different major kernel functions in each application. It can be seen that for implementing the listed 6 applications, only 689 entries are required. It shows that the memory required for the storage of the CSLUT table is insignificant as far as a general purpose video/image processor is concerned. As such, the proposed SIMD architecture can support multiple video applications well by simply using different entries of the CSLUT table for different applications. The proposed features only increase the area of the video processor by not more than 5% of the total area. As a brief account, the CSIMD LUT contributes to about 4% increase in area, while the crossbar switch and CSIMD control contribute to 0.23% and 0.6% increase in area resp.
With these features, the SIMD performance when implementing matrix transpose, DCT/IDCT transform and SATD can be significantly improved. The H.264/AVC fractional motion estimation can also be implemented efficiently. The number of memory access can be greatly reduced. In fact, the proposed CSIMD structure can also greatly benefit the implementation of other kernel functions such as the Luma 4x4 intra prediction. Due to page limitation, it has not been explained in detail in this paper. REFERENCES [1]
[2]
[3]
[4]
[5] TABLE IX NUMBER OF CSLUT CONFIGURATION ENTRIES FOR DIFFERENT IMAGE AND VIDEO APPLICATIONS. [6] Video Application H.264/AVC Encoder H.264/AVC Decoder AVS-M Decoder AVS Decoder MPEG4 Decoder Image Processor
Fractional Interpolation 147 88 62 56 32 0
Data Shuffle 82 52 50 27 28 21
SATD
Transform
8 0 0 0 0 0
8 8 4 4 4 8
[7]
[8]
[9]
VI. CONCLUSION In this paper, we have proposed a novel SIMD architecture with two new features namely, parallel memory structure with variable block size and word length support, and configurable SIMD (CSIMD) structure using a look up table. When applying to block based image or video applications, the proposed parallel memory structure provides extra flexibility in supporting multiple block sizes and multiple word lengths data access by changing only a few parameters in the address generation units. The hardware complexity of implementing these address generation units is negligible as far as a general purpose image and video processor is concerned. By using the proposed parallel memory structure, a vector of 16 bytes, words and double words can be retrieved (or stored) from (to) the memory in 1, 2 and 4 cycles respectively. On the other hand, the proposed CSIMD structure allows nearly “random” data access to SIMD registers by means of a crossbar switch. Programmers can specify the row number and the register bank to be accessed in the CSLUT table, which we have shown to require only a small amount of internal memory for its implementation. Programmers can also define using the CSLUT table slightly different operations among the ALUs.
13
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Wiegand, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra, “Overview of the H.264/AVC Video Coding Standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576, Jul. 2003. Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture [Online]. Available: http://www.intel.com/products/processor/manuals. K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales, “AltiVec Extension to PowerPC Accelerates Media Processing,” IEEE Micro, vol. 20, no. 2, pp. 85-95, Mar.-Apr. 2000. Yong-Hwan Kim, Jin-Woo Yoo, Seong-Won Lee, Joonki Paik, and Byeongho Choi, “Optimization of H.264 Encoder Using Adaptive Mode Decision and SIMD Instructions,” Proc., International Conference on Consumer Electronics, pp. 289-290, Jan. 2005. Yu Shengfa, Chen Zhenping, and Zhuang Zhaowen, “Instruction-Level Optimization of H.264 Encoder Using SIMD Instructions,” Proc., International Conference on Communications, Circuits and Systems Proceedings, vol. 1, pp. 126-129, Jun. 2006. Marco Raggio, Massimo Bariani, Ivano Barbieri, Davide Brizzolara, “H.264 Implementation on SIMD VLIW Cores,” STreaming Day 07, Genova, September 2007. Juyup Lee, Sungkun Moon, and Wonyong Sung, “H.264 Decoder Optimization Exploiting SIMD Instructions,” Proc., IEEE Asia-Pacific Conference on Circuits and Systems, vol. 2, pp. 1149-1152, Dec. 2004. Lv Huayi, Ma Lini, Liu Hai, “Analysis and Optimization of the UMHexagonsS Algorithm in H.264 based on SIMD,” Communication Systems, Networks and Applications, pp.239-244, Jun. – Jul. 2010. Ali R. Iranpour, and Krzysztof Kuchcinski, “Evaluation of SIMD Architecture Enhancement in Embedded Processors for MPEG-4,” Proc., Euromicro Symposium on Digital System Design, pp. 262-269, Aug. 2004. Ye Jianhong, and Liu Jilin, “Fast Parallel Implementation of H.264/AVC Transform Exploiting SIMD Instructions,” Proc., International Symposium on intelligent Signal Processing and Communication Systems, pp. 870-873, Nov. 2007. Joohyun Lee, Gwanggil Jeon, Sangjun Park, Taeyoung Jung, and Jechang Jeong, “SIMD Optimization of the H.264/SVC Decoder with Efficient Data Structure,” Proc., IEEE International Conference on Multimedia and Expo, pp. 69-72, 2008. Stephen Warrington, Hassan Shojania, Subramania Sudharsanan, and Wai-Yip Chan, “Performance Improvement of the H.264/AVC Deblocking Filter Using SIMD Instructions,” Proc., IEEE International Symposium on Circuits and Systems, pp. 21-24, May 2006. Deepu Talla, Lizy Kurian John, and Dong Burger, “Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements,” IEEE Transactions on Computers, vol. 52, no. 8, pp. 1015-1031, Aug. 2003. Mauricio Alvarez, Esther Salami, Alex Ramirez, and Mateo Valero, “Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Application,” Proc., IEEE International Symposium on Performance Analysis of Systems and Software, pp.62-71, Apr. 2007. Deependra Talla, Architectural Techniques to Accelerate Multimedia Applications on General-Purpose Processors, Ph.D. dissertation, University of Texas at Austin, 2001. Jarno K. Tanskanen, Tero Sihvo, and Jarkko Niittylahti, “Byte and Modulo Addressable Parallel Memory Architecture for Video Coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14,
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < no. 11, pp. 1270-1276, Nov. 2004. [17] Hoseok Chang, Junho Cho, and Wonyong Sung, “Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit,” Proc., IEEE Workshop on Signal Processing Systems Design and Implementation, pp. 71-76, Oct. 2006. [18] Georgi Kuzmanov, Georgi Gaydadjiev, and Stamatis Vassiliadis, “Multimedia Rectangularly Addressable Memory,” IEEE Transactions on Multimedia, vol. 8, no. 2, pp. 315-322, Apr. 2006. [19] Zhi Zhang, Xiaolang Yan, and Xing Qin, “An Efficient Programmable Engine for Interpolation of Multi-Standard Video Coding,” Proc., IEEE International Conference on ASIC, pp. 750-753, Oct. 2007. [20] Kunjie Liu, Xing Qin, Xiaolang Yan, and Li Quan, “A SIMD Video Signal Processor with Efficient Data Organization,” IEEE Asian Solid-State Circuis Conferencet, pp. 115-118, 2006. [21] S. Seo, M. Who, S. Mahlke, T.Mudge, S. Vijay, C. Chakrabarti, “Customizing Wide-SIMD Architecture for H.264,” IEEE International Symposium on Systems, Architecture, Modeling and Simulation, pp. 172-179, Jul. 2009. [22] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen, “Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder,” IEEE Transaction on Circuits and Systems for Video Technology, vol. 15, no. 3, pp. 378-401, Mar. 2005. [23] X264 Free H.264/AVC Encoder [Online]. Available: http://www.videolan.org/developers/x264.html. [24] Wing-Yee Lo, Simon Moy, “Configurable SIMD Processor Instruction Specifying Index to LUT Storing Information for Different Operation and Memory Location for Each Processing Unit,” U.S. Patent 7,441,099 B2, Filed in October 2006, Granted in October 21, 2008. [25] Simon Knowles, “Apparatus and Method for Configurable Processing,” US 2006/0253689 A1, Published in November 9, 2006. [26] Keith Diefendorff, and Pradeep K. Dubey, “How Multimedia Workloads Will Change Processor Design,” Computer, vol. 30, iss. 9, pp.43-45, Sep. 1997. [27] H.264/AVC JM Software Reference Model [Online]. Available: http://iphome.hhi.de/suehring/html. [28] D. A. Patterson and J. L. Hennessy, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1996.
Wing-Yee Lo received the B.Eng. (hons) degree from the Northumbria University, UK, and the MPil. degree from the Chinese University of Hong Kong, both in Electronics Engineering. She has more than 10 years of ASIC design experience in Motorola Semiconductors Hong Kong Ltd., VTech Communications Ltd., Hong Kong Applied Science and Techonogy Institute, and Beijing SimpLight Nanoelectronics Ltd. She is very familiar with ASIC design flow and has been working on various SoC chips for mobile and consumer products, video processor architectural analysis and parallel processor designs. She joined a Shenzhen startup company as Director of ASIC engineering since 2009. She is currently a doctoral candidate in the Hong Kong Polytechnic University.
Daniel Pak-Kong Lun (M’91) received the B.Sc. (hons.) degree from the University of Essex, Essex, U.K., and the Ph.D. degree from the Hong Kong Polytechnic University, Hong Kong, in 1988 and 1991, respectively. He is now an Associate Professor and Associate Head of the Department of Electronic and Information Engineering, the Hong Kong Polytechnic University. His research interests include digital signal processing, wavelets, and Multimedia Technology. Dr. Lun is a Chartered Engineer and corporate member of the IET and HKIE. (Home Page : http://www.eie.polyu.edu.hk/~enpklun)
14
Wan-Chi Siu (M’77, SM’90) received the MPhil degree from The Chinese University of Hong Kong and the PhD Degree from Imperial College of Science, Technology & Medicine in October in 1977 and 1984 respectively. He joined The Hong Kong Polytechnic University as a Lecturer in 1980 and has become Chair Professor in the Department of Electronic and Information Engineering since 1992. He was Head of the same department and subsequently Dean of Engineering Faculty between 1994 and 2002. He is now Director of Centre for Signal Processing of the same university. He is an expert in Digital Signal Processing, specializing in fast algorithms and video coding, and has published 380 research papers, over 160 of which appeared in international journals, such as IEEE Transactions on CSVT. His research interests also include transforms, image coding, wavelets, and computational aspects of pattern recognition. Professor Siu has been/was Guest Editor, Associate Editor and Member of editorial board of a number of journals, including IEEE Transactions on Circuits and Systems, Pattern Recognition, Journal of VLSI Signal Processing Systems for Signal, Image, Video Technology, and the EURASIP Journal on Applied Signal Processing. He is a very popular lecturing staff member within the University, while outside the University he has been a keynote speaker of over 10 international/national conferences in the recent 10 years, and an invited speaker of numerous professional events, such as IEEE CPM’2002 (keynote speaker, Taipei, Taiwan), IEEE ISIMP’2004 (keynote speaker, Hong Kong), and IEEE ICICS’07 (invited speaker, Singapore) and IEEE ICNNSP’2008 (keynote speaker, Zhenjiang). He is the organizer of many international conferences, including the MMSP’08 (Australia) as General Co-Chair, and three IEEE Society sponsored flagship conferences: ISCAS’1997 as Technical Program Chair; ICASSP’2003 as the General Chair; and recently ICIP’2010 as the General Chair (2010 IEEE International Conference on Image Processing, which was held in Hong Kong, 26-29 September 2010). Prof. Siu is also the President Elect (2011-13) of a new professional association, the “Asia-Pacific Signal and Information Processing Association”, APSIPA. He is a member (2010-2012) of the Engineering Panel and also was a member of the Physical Sciences and Engineering Panel (1991-1995) of the Research Grants Council (RGC), Hong Kong Government. In 1994, he chaired the first Engineering and Information Technology Panel of the Research Assessment Exercise (RAE) to assess the research quality of 19 departments from all universities in Hong Kong. (Home Page : http://www.eie.polyu.edu.hk/~wcsiu/mypage.htm) Wendong Wang received the B.S. degree in electrical engineering from Shandong University, China, and M.S. degree in computer science in Beijing University of Technology, China, in 1997 and 2004, respectively. He is a senior software engineer in SimpLight Nanoelectronics Ltd., Beijing and focus on computer architecture analysis and video processing algorithm development.
Jiqiang Song (M’01, SM’07) received the B.Sc. and Ph.D. degrees from Nanjing University, China, in 1996 and 2001, respectively, both in Computer Science and Application. He worked in the Department of Computer Science and Engineering of the Chinese University of Hong Kong as Postdoctoral Fellow from 2001 to 2004. After that, he joined Hong Kong Applied Science and Technology Institute as Algorithm Lead in a video processor project. In 2006, he worked in Simplight Nanoelectronics Ltd., Beijing, as R&D Director of Multimedia and engaged in multimedia SIMD processor development. He joined Intel Labs China as Staff Research Scientist in 2008. His research interests include graphics recognition, video encoding, image and video processing. He has published over 30 research papers in international journals and conferences.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing
[email protected].