S.-H. Lee et al.: Lossless Frame Memory Recompression for VideoCodec Preserving Random Accessibility of Coding Unit
2105
Lossless Frame Memory Recompression for Video Codec Preserving Random Accessibility of Coding Unit Sang-Heon Lee, Moo-Kyoung Chung, Sung-Mo Park, and Chong-Min Kyung, Fellow, IEEE Abstract — In recent video applications such as MPEG or H.264/AVC, the bandwidth requirement for frame memory has become one of the most critical problems. Compressing pixel data before storing in off-chip frame memory is required to alleviate this problem. In this paper, we propose a lossless frame memory recompression scheme including 1) a lossless pixel compression algorithm, 2) an efficient address table organization method for random accessibility, and 3) frame memory placement scheme for compressed data to reduce the effective access time of SDRAM by suppressing row switching. Experimental results show that the proposed method reduces the frame data to 48% compared to that of the uncompressed one with H.264/AVC high profile encoder system, where 6.1kB of SRAM is required for the address table of full HD video. Index Terms — Frame memory recompression, lossless compression, address table, SDRAM.
I. INTRODUCTION Multimedia applications such as MPEG or H.264/AVC inherently require heavy data traffic between the video processing core and frame memory to read original or reference frame, and to write the reconstructed frame. As the size of a frame memory that stores the original, reference, and reconstructed frames approaches tens of millions bytes, frame memory is usually placed outside the video processing chip. Off-chip SDRAM is mainly used as frame memory for its integration density. The bandwidth requirement between a video processing chip and off-chip memory steadily increases as the demand for higher resolution is persistent and the complexity of video standard grows for higher coding efficiency. For example, H.264/AVC high profile system adopted data transfer-intensive coding tools such as bidirectional prediction, interlaced video, and multiple reference frames, which induce heavy data reference to the off-chip memory. Data transfer has thus become one of the most serious obstacles in designing a high performance multimedia system. The bandwidth requirement for the H.264/AVC high profile encoder with full HD video is about Sang-Heon Lee is with the Electrical Engineering Department, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, Korea (email:
[email protected]). Moo-Kyoung Chung is with the Electronics and Telecommunications Research Institute, Daejeon, 305-700, Korea (e-mail:
[email protected]). Sung-Mo Park is with the Electronics and Telecommunications Research Institute, Daejeon, 305-700, Korea (e-mail:
[email protected]) Chong-Min Kyung is with the Electrical Engineering Department, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, Korea (email:
[email protected]). Contributed Paper Manuscript received August 26, 2009
5Gb/s-22.4Gb/s depending on encoding options of the algorithm such as motion estimation (ME) algorithm, bidirectional prediction, interlace/progressive, and the number of reference frames. On the other hand, state-of-the-art memory [15] DDR3 SDRAM operating at 800MHz clock speed with 16-bit organization achieves only 12.8Gb/s bandwidth even ignoring the inherent latency of SDRAM including row activation, refresh, etc. Recently high performance multimedia features are widely adopted by mobile consumer products such as camcorder, video cell phone, digital camera, notebook, etc. These devices are very sensitive to the power consumption and package cost. The high memory bandwidth requirement induces high clock frequency, and expensive memory system, thus it should be minimized. Frame memory recompression (FMR) is used to compress the pixel data before they are being written to a frame memory so that the data communication bandwidth requirement to/from the frame memory decreases. Fig. 1 is a conceptual block diagram describing a video encoder system with FMR.
Fig. 1. A video encoder system with FMR
There are mainly five important features in the FMR [2]. First, compression ratio is to be maximized to minimize the off-chip memory access bandwidth. Data compression is performed on each block called basic compression unit or accessing unit. Generally bigger block results in better compression ratio (CR) defined as CR = compressed frame size original frame size
but with higher probability to read unnecessary data. Optimal block size must be determined at a trade-off point in between. It is also important to minimize the distortion or loss of data. To reduce the power consumption, design cost, and system operating cost, it is critical to minimize the hardware for compression/decompression. Finally, to reduce the effective bandwidth, the latency of compression as well as decompression must be maximally reduced. In this paper we will propose a lossless compression methodology satisfying
0098 3063/09/$20.00 © 2009 IEEE
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
2106
these features, i.e., minimizing the data rate, data loss/signal distortion, hardware complexity, and latency for compression/decompression. There are various compression methods such as adaptive vector quantization (AVQ) [1], adaptive dynamic range coding (ADRC) [2], adaptive differential pulse code modulation (ADPCM) [3], 2-D ADPCM [4], golomb-rice coding (GR coding) with modified hadamard transform (MHT) [5], GR coding with DPCM [6], predictive pattern decision [7], and adjusted binary code (ABC) [8]. These works are basically lossy compression method with the incurred loss of about 0.5dB-3dB, which often deters the adoption of these techniques in practical applications. Recently some lossless compression methods were presented [9]-[11]. Differential of adjacent pixel (DAP) method and huffman coding show about 40% data reduction [9]. Dictionary-based algorithm proposed in [10] results in about 30% reduction. In [11], MHT and adaptive golomb-rice (AGR) coding are used resulting in about 40% reduction. However, these schemes do not provide a concrete method for accessing the compressed data. To achieve real bandwidth reduction, it is needed to access the data on demand with its exact length without wasting bandwidth with unrequested data. Because in the lossless compression the length of compressed data cannot be guaranteed to be smaller than a certain value, it is strongly required to provide a data access method without incurring excessive overhead. There are other approaches based on data reordering to reduce the time to access SDRAM. Row activation operation performed before accessing data stored in SDRAM is a timeand power-consuming process. In [12]-[14], architectural approaches were proposed to minimize the number of row changes or to hide the row change latency. However, all these works are applicable only for raw pixel data, i.e., uncompressed data. This paper proposes a low-power lossless FMR scheme. The contribution of this paper is three-fold: 1) a simple lossless compression method, 2) an efficient address table organization scheme allowing data access at arbitrary position, and 3) memory placement scheme for compressed data to reduce the memory access latency and power consumption. The proposed scheme is scalable with respect to compression unit size, bus width, and SDRAM size. For a consistent explanation of the proposed scheme, we adopted a specific example: 8x8 compression unit, 64-bit bus width, 2048-byte page size with 1088 pages per bank, which allows one full HD frame to be stored in a bank. The rest of this paper is organized as follows. Section II describes the proposed lossless compression method, and Section III describes an efficient address table organizing scheme. A low-latency memory placement scheme is described in Section IV. Experimental results are shown in Section V.
IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
II. PIXEL DATA COMPRESSION The ultimate purpose of the proposed FMR scheme is the reduction of memory bandwidth while sustaining the video quality. The compression process needs to have small latency and to be lossless. Traditionally, for video or image data compression, combination of transform and quantization is widely used to reduce the redundancy. Transform such as discrete cosine transform (DCT) and hadamard transform (HT) converts data into frequency domain, and by eliminating high frequency components with quantization, a good CR can be achieved at some sacrifice of the video quality. However, in this paper to keep the video quality, quantization is not to be included in the compression process. Transform is a complex process needed as a preprocessing step for the quantization; transform by itself is without merit. To limit the decompression latency within a few cycles, we propose the so-called hierarchical minimum and difference (HMD) method which uses only add operations in the decompression process. First the pixel with the minimum value is searched within each 2x2 block to be called 'min2x2'. Then the difference between each pixel value and the relevant min2x2 is computed for each pixel to be called 'diffpixel'. Next, for each 4x4 block the minimum value among the four min2x2's is found to be called 'min4x4'. Then the difference between the min2x2 for each 2x2 block and the relevant min4x4 is computed to be called 'diff2x2'. This method is repeated for bigger blocks. Fig. 2 describes this process.
Fig. 2. HMD algorithm example for a 8x8 block
Decompression, i.e., calculation of value for each pixel, is performed by adding up the differences, for all levels as described in (1) where a, b, and c are the indices for 4x4 block, 2x2 block, and pixel, respectively. pixel [a][b][c] = min8x8 + difference = min8x8 + diff4x4[a]
(1)
+ diff2x2[a][b] + diffpixel [a][b][c]
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
S.-H. Lee et al.: Lossless Frame Memory Recompression for VideoCodec Preserving Random Accessibility of Coding Unit
Fig. 3. The cumulative distribution of difference values
Fig. 3 shows the cumulative distribution of difference values for 8x8 block size ranging from 0 to 15 out of (0, 255) extracted from reference frame or reconstructed frame generated during the encoding of seven full HD videos. The frequency of difference value is very high near zero and rapidly decreases. In the next step, variable length coding (VLC) is used to reduce the bit length. As shown in Fig. 3 the frequency of occurrence of difference with value 0 ranges from 40% to 71%. In five of the seven videos, larger than 83% is covered by values from 0 to 3. Based on these observations, we can expect that the VLC can effectively compress the differences. We adopted Exp-Golomb coding which is an entropy encoding assigning short code to small value. If the length of compressed data exceeds that of the original data, no compression occurs. One bit is assigned as a mode indicator: 1 for compression and 0 for no compression. Fig. 4 shows the data packing method in the case of 8x8 block size. A compressed data pack consists of 1-bit for mode selection, 8-bit for minimum value, differences between minimum and the next lower-level minimum, and differences between the lowest-level minimum and pixel. Some latency in the compression process due to minimum value search, Exp-Golomb coding, and packing is acceptable because the compressed data are used as a reference for the
2107
next incoming frame. The decompression process requires two steps: Exp-Golomb decoding, and adding minimum value and difference value. In [5], a low-latency GR decoding hardware implementation is presented where seven code elements are decoded in a cycle. The architecture is applicable to ExpGolomb decoder. Because there are 84 code elements in a 8x8 compressed block, 12 cycles are needed. Assuming a pipelined architecture and one cycle for addition, 13-cycle latency is required to decompress a 8x8 block. The proposed compression method is scalable to bigger blocks. The compression unit can be determined according to the characteristics of video and features of encoder algorithm such as motion estimation and intra prediction. The address table size is another factor in deciding the compression unit, because smaller compression unit requires more address table entries and, therefore, larger address table. This is examined in detail in Section III. III. ADDRESS TABLE COMPRESSION The bit length after the lossless compression is dependent on the data and, therefore, differs for each video data. To be able to access data with arbitrary length on demand, the starting address and length must have been pre-stored in an address table. The size of the address table can be calculated as follows.; table size = (# of blocks)× (# of bits per address)/8 # of blocks = (frame size)/(compression unit size) # of bits per address = ⎡log 2 (frame size(byte))⎤
(2)
In this calculation, the starting address is only considered, assuming that the bit length of a block can be obtained from the difference between the starting address of the current block and that of the consecutive block. Table I shows the required address table size according to full frame resolution and compression unit size. If a full HD frame is compressed in the unit of 8x8 block, the address table amounts to 85kB. Internal SRAM is preferred to external SDRAM to store the
Fig. 4. Compressed pixel data packing scheme
TABLE I ADDRESS TABLE SIZE ACCORDING TO RESOLUTION AND COMPRESSION UNIT Compression unit size 4x4 8x8 16x16 Resolution # of bits # of blocks Table size # of blocks Table size # of blocks Table size per address (kB) (kB) (kB) 21 129600 340.2 32400 85 8100 21.3 FullHD(1920x1080) 20 57600 144 14400 36 3600 9 HD720(1280x720) 19 21600 51.3 5400 12.8 1350 3.2 D1(720x480) 19 19200 45.6 4800 11.4 1200 2.8 VGA(640x480) 17 6336 13.5 1584 3.4 396 0.8 CIF(352x288)
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
2108
Fig. 5. Compressed length distribution of frames of full HD videos: (a) blue sky, and (b) rush hour, in case of 8x8 compression unit and 64-bit bus width
address table to avoid off-chip traffic overhead and communication latency. However, because the table is too large to fit into the internal SRAM, it needs to be compressed in a lossless way with a good enough CR. A. Length First we can quantize the length of compressed data in the unit of the step size that is the same as the off-chip bus width between SDRAM and controller. For example, when a compressed bit length is 100bit and the bus width is 64bit, two transfer cycles are required for the transfer of the compressed data. Thus it can be regarded as 128bit, or simply represented as 2 units as normalized with 64bit. The maximum size of a compressed block of 8x8 is 1 + 512 = 513 as described in Section II. When all the pixels have the same value, all the differences become zero, and the compressed data become minimal which can be calculated to be 93 as shown in (3). Min _ Length = 1bit + 8bit + 4 × 1bit + 16 × 1bit + 64 × 1bit
: mode indicator : min value : 4 diff4x4' s :16 diff2x2' s : 64 diffpixels
= 93bit
Because the quantized value of the minimum size is ⎡93 / 64⎤ = 2 and that of maximum is ⎡513 / 64⎤ = 9 , there are in total eight steps which can be expressed by a 3-bit index as shown in Fig. 5 by mapping (2, 9) into (0, 7). Further compression is possible due to the locality, i.e., similarity between neighbors, in the compressed length as well as raw pixel data. Fig. 6 shows the contour of the equal quantized bit length of compressed blocks of full HD frames. It is shown that the compressed length changes smoothly, which allows us to compress the compressed length based on this locality property. For fixed length compression, and thus fixed address table size, we adopted a pattern matching method. The proposed compression process is described in Fig. 7. In this example, the quantized sizes of the compression unit blocks are shown as (n1 , n2 , n3 , n4 ) = (4,5,7,8) . Then each component of (n1 , n2 , n3 , n4 ) quadraplet is to be coded as a 3-bit code. Each pattern (n1 , n2 , n3 , n4 ) is matched to and denoted by ( p1 , p2 , p3 , p4 ) from the template of limited
(3)
number of patterns if each ni is minimally covered by pi , i.e., ni ≤ pi with minimal pi − ni among all patterns in the template. As the storage and access of compressed data occur according to the normalized size of ( p1 , p2 , p3 , p4 ) , not (n1 , n2 , n3 , n4 ) , the difference between ni and pi corresponds to the loss of bandwidth and storage, which is minimized by the selection of the pattern with minimal distance. The total difference in this example is 2 (due to
Fig. 6. Quantization and 3-bit coding for compressed length
2→ 3 and 5→ 6 in the mapping of 2356 by 3366).; This mapping induces two additional clock cycles to access that block and two extra storage units. Each pattern is represented by three parameters; base level, step size, and pattern index as shown in Fig. 7(c).
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
S.-H. Lee et al.: Lossless Frame Memory Recompression for VideoCodec Preserving Random Accessibility of Coding Unit
2109
TABLE II BIT LENGTH OF EACH ADDRESS COMPONENT Bit width expression Bit width value for the example case
⎡ # of blocks in a width × max. block size ⎤ ⎢log 2 ⎥ SDRAM page size ⎢ ⎥
Row offset address
Step size
SDRAM page size ⎤ ⎡ ⎢⎢log 2 ⎥⎥ bus width max. block size − min. block size ⎤ ⎡ ⎢⎢log 2 ⎥⎥ bus width ⎡log 2 (# of options for step size)⎤
Pattern index
⎡log 2 (# of patterns)⎤
Column address Base level
4
8
5
7
1920 ⎤ ⎡ ( × 4) × 513 ⎥ ⎢ 16 ⎥=4 ⎢log 2 2048 × 8 ⎥ ⎢ ⎥ ⎢ 2048 × 8 ⎤ ⎡ ⎢⎢log 2 64 ⎥⎥ = 8 513 − 93 ⎤ ⎡ ⎢⎢log 2 64 ⎥⎥ = 3 1; ‘0’ for step size 1, and ‘1’ for step size 3 2, 3, and 4 for 4, 8, and 16 patterns, respectively
8
3- bit coding 8
0
1
2
3
5
6
approximation 2
3
3
Quant_comp_length(0) Quant_comp_length(1) Quant_comp_length(2) Quant_comp_length(3) (a)
= = = =
4 5 7 8
3
6 6 (b)
Base level: 3 Step size: 3 Pattern:
3
3
6
6
(c)
Fig. 7. Conceptual illustration of pattern based compression: (a) four compression unit blocks, (b) coding and approximation, and (c) code components
Fig. 8 shows three different pattern sets; 4, 8, and 16 patterns which require 2, 3, and 4-bit index, respectively. Pattern set as well as pattern within each set is selected according to locality property. It is determined at the design time that which size of a pattern set is to be used based on the trade-off between address table size and bandwidth reduction.
Fig. 9. Pattern set
For better address compression we group several adjacent patterns. The patterns in a pattern group share the same base level and step size. If the pattern group becomes too large, enforcing the same base level and the step size can result in excessive losses of storage and bandwidth. There should be a trade-off between the bandwidth reduction and the address table size reduction. B. Start address For a random access of each coding unit such as macroblock (MB) in H.264, we need to store the start address
Fig. 8. Start address: (a) original frame, and (b) compressed block placement in SDRAM
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
2110
of each pattern group that includes one or more patterns. The start address consists of row address and column address. To reduce the row address overhead, it is divided into base and offset address. For each starting block in the uncompressed frame in every MB row, i.e., MB(0, y) in Fig. 9 (a), the SDRAM row address is stored in a table to be called 'base address'. For other blocks in the same row in the uncompressed frame, only the offset address relative to the base address is stored. Column address for the first block of each pattern group is stored as in Fig. 9 (b). Within each pattern group, the position of a block relative to the first block can be calculated by summing the sizes of blocks in between. C. Packing Figure 10 shows the address packing scheme. An address entry representing a pattern group consists of row offset address, column address, base level, step size, and pattern indices. The bit length of each component can be calculated as in Table II.
Fig. 10. Address packing scheme
Table III shows the address table size according to the number of patterns in a group and pattern index bits for full HD video. The necessary SRAM size varies from 3kB to 20kB. Note that the portion of start address is large when the number of patterns in a group is small. TABLE III ADDRESS TABLE SIZE ACCORDING TO THE NUMBER OF PATTERNS IN A GROUP AND PATTERN INDEX BITS FOR FULL HD VIDEO # of patterns in a group
# of address entries
1
8160
2
4080
3
2720
4
2040
8
1020
16
510
# of bits for pattern index 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4
Total address bit 18 19 20 20 22 24 22 25 28 24 28 32 32 40 48 48 64 80
Table size (kB) 18.4 19.4 20.4 10.2 11.2 12.2 7.5 8.5 9.5 6.1 7.1 8.1 4.1 5.1 6.1 3.1 4.1 5.1
On the other hand, row base address table size is determined by (4).
Row _ base _ size = ( # of blocks in a height) × ⎡log 2 (SDRAM row size) ⎤
(4) 1088 = × ⎡log 2 (1088) ⎤ 16 = 93.5byte Note that the address table needs to be accessed once for each pattern group, not for every compression unit. For a burst access within a pattern group, the influence of table access latency on the bandwidth would be negligible. IV. REFERENCE MEMORY PLACEMENT SCHEME SDRAM is usually used as large-capacity off-chip frame memory with inherent latency. Allocation of the compressed data in SDRAM must be carefully performed, because it affects the memory access latency. Several works were done for uncompressed data [12]-[14]. With raw or uncompressed pixel data, allocating block data, i.e., MB, in a regular form is possible and natural. Regular allocation method may be used for compressed data by placing data at a fixed position leaving empty space between data. However, if the compressed data having arbitrary bit length can be allocated compactly, forming an irregular data placement one can save memory space and bandwidth. Another benefit of irregular data placement is the reduction of power consumption and latency as the number of row changes decreases because smaller number of rows are needed to store a frame. Third, when consecutive MBs are read with burst operation, the burst start address should be changed with every MB in the regular placement case. But in the irregular placement case there is no need for the SDRAM controller to drive address signals many times, which saves power consumption. Generally SDRAM consists of 4 or 8 banks sharing address buffer and I/O buffer. Each bank has its own 2-D memory array, row decoder, column decoder, and sense amplifier. To access a memory cell requires two steps. First, a row is activated which means that an entire row is loaded to an internal buffer. And then a column in the row is selected to read or write data. Once a row is activated, columns in the row can be accessed once and again without additional delay. But to access data in another row, the activated row data in the buffer should be first written back into the memory array. This is the precharge process. After precharge, a new row needs being activated. Changing row is a time- and powerconsuming process, and needs to be minimized. However, each bank has its own sense amplifier and buffer, changing banks does not require row change. Thus, bank is usually utilized to avoid row change. The memory access style of each function module such as motion estimation (ME), motion compensation (MC), intra prediction (IP), and deblocking filter (DB) is different from each other. For example, ME module requires MBs of reference frames belonging to a search window which moves along the search center. When the search center moves horizontally, many of the MBs in the present search window overlap with that of previous one. The MBs needed to be
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
S.-H. Lee et al.: Lossless Frame Memory Recompression for VideoCodec Preserving Random Accessibility of Coding Unit
accessed are located at the boundary of the search window demanding vertical direction access as shown in Fig. 11 (b). On the other hand, IP module requires upper MBs in horizontal direction as shown in Fig. 11 (a). Although each function module has its own access style, we can simply classify them into vertical and horizontal direction MBs. Thus the proposed memory placement scheme is requested to minimize row changes for these access styles.
2111
MBs might extend over two pages at the end of page. Accessing these MBs induces row change, which, however rarely occurs. V. EXPERIMENTAL RESULT To evaluate the efficiency of the proposed compression scheme, experiments were performed with the reference software of H.264/AVC high profile encoder for full HD video with following options: Frame sequence is IBPBP; QP value is 28; Search range of ME is [V:±16,H:±24]. The following results have been obtained by executing the encoder with several 4:2:0 full HD videos for 30 frames. TABLE IV CR WITH THE PROPOSED FOR FULL HD VIDEOS
Fig. 11. Reference memory access style of (a) IP module, and (b) ME module
Y 34.19% 34.52% 49.8% 33.26% 40.3% 37.61% 48.92% 39.79%
U 30.22% 22.44% 25.91% 22.7% 22.64% 36.78% 36.64% 28.19%
V 26.59% 21.61% 20.48% 21.73% 22.96% 30.32% 33.91% 25.37%
Total 31.04% 28.97% 39.53% 28.17% 33.06% 34.79% 42.98% 34.08%
Table IV shows the CR by applying the proposed HMD pixel compression method for each YUV component. UV components which usually have high locality property show better CR. The CR varies from 28.19% to 42.98%, and the average is 34.08%. Table V describes the variation of CR according to the compression block size for each case of with or without 64-bit quantization. Bigger block shows better CR. The performance degradation due to quantization is larger for smaller block, because there are more chances to access garbage bit area. Note that the accessibility is inverse proportion to compression block size, thus the choice of the size depends on the features of encoder/decoder system.
Vertical access
Fig. 12 shows the proposed frame memory placement scheme for compressed data assuming four-bank SDRAM. In the proposed scheme, every fourth MB row of the original frame, uncompressed frame is mapped to each bank. Through this, the horizontal direction access requires no row change by definition. Vertical direction access takes place along the adjacent banks without causing any row change. However, when the number of MBs to be accessed in vertical direction is larger than the number of banks, four in this case, the fifth MB which belongs to the same bank with the first MB would cause row change. In this case, we can hide at least the row change latency by predicting the change, although the power consumption due to row change is inevitable. To store a MB row which comprises 1920 / 16 = 120 MBs for full HD video, several rows or pages are required as in Fig. 12 (b). Some
blue_sky pedestrian_area riverbed rush_hour station2 sunflower tractor Average
HMD COMPRESSION METHODOLOGY
Fig. 12. Frame memory placement scheme for four-bank SDRAM: (a) original frame, and (b) placement in SDRAM
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
IEEE Transactions on Consumer Electronics, Vol. 55, No. 4, NOVEMBER 2009
2112 TABLE V CR WITH HMD METHODOLOGY AND 64-BIT QUANTIZATION ACCORDING TO THE COMPRESSION BLOCK SIZE FOR LUMINANCE COMPONENT OF FULL HD VIDEOS
blue_sky pedestrian area riverbed rush_hour station2 sunflower tractor
without quant. with quant. without quant. with quant. without quant. with quant. without quant. with quant. without quant. with quant. without quant. with quant. without quant. with quant.
Compression block size 4x4 8x8 16x16 36.21% 34.19% 33.83% 59.29% 39.66% 35.22% 36.51% 34.52% 34.15% 56.68% 39.98% 35.53% 50.85% 49.8% 49.6% 71.93% 55.25% 50.96% 35.36% 33.26% 32.82% 55.15% 38.61% 34.18% 41.98% 40.3% 39.97% 61.69% 45.72% 41.33% 39.21% 37.51% 37.18% 57.0% 42.93% 38.54% 50.03% 48.92% 48.72% 71.99% 54.39% 50.08%
Table VI shows the final CR including the address compression with the 2-bit pattern. As a pattern group shares the same base level and step size, assigning many patterns in a group leads to poor CR. The CR varies from 45% to 55% according to the number of patterns in a group. Fig. 13 shows the comparison of CR according to the size of pattern set, where '0' in the x-axis means no address compression. As a matter of course the more patterns result in better performance, but the difference is not so big.
Fig. 13. CR variation according to number of patterns in a pattern group and the size of pattern set
Two previous works presenting lossless compression scheme [9], [11] are implemented in our simulation environment. Whereas [11] introduced line based compression for display device, we applied it to block based compression for comparison. The CRs by [9] and [11] are 37.5% and 44.5%, respectively, whereas the proposed method results in 47.6% CR where the address table size is 6.1kB. The proposed scheme has two major prominent aspects, parallelism and accessibility. The proposed prediction method, HMD is more suitable for parallel decompression than [9] in which a block should be decompressed pixel by pixel, because a prediction value is affected by the adjacent pixel. This work is the first attempt addressing random accessibility problem which is essential for lossless FMR. For that, an address table is required which amounts to 85kB for full HD video. As it is too large to be implemented on chip, it is compressed to 6.1kB by the proposed table organization method. Because of the trade-off relation between pixel compression and address table compression, the pixel CR is degraded slightly to organize the address table with marginal resource. VI. CONCLUSION Reducing the memory bandwidth requirement for high resolution video system is essential to decrease clock frequency and memory system cost of mobile consumer products. For that, we have proposed a FMR scheme which is lossless, low power consuming, and random accessible. For simple and parallel decompression we propose the HMD method. An address table organization scheme is presented for random accessibility with low internal SRAM overhead. To reduce power and delay due to the row change of SDRAM, data placement method suitable for compressed data is also proposed. Experimental results on various video streams demonstrate that the proposed scheme shows good CR while keeping random accessibility and low latency, thereby reducing the bandwidth requirement without any loss of video quality. Lossless compression is an important requirement for FMR scheme to be applied to consumer products. And random accessibility to frame data should be guaranteed during video coding process. As the proposed scheme is the first work satisfying these two necessities, it is applicable to practical application.
TABLE VI CR APPLIED THE ADDRESS COMPRESSION ACCORDING TO THE NUMBER OF PATTERNS IN A PATTERN GROUP # of patterns in a pattern group Without address 1 2 3 4 8 16 compression 37.76% 41.44% 42.91% 43.96% 44.82% 47.66% 52.97% blue_sky pedestrian_area
35.8%
39.17%
40.36%
41.39%
42.35%
45.23%
riverbed
46.3%
50.0%
51.0%
51.78%
52.44%
54.64%
49.79% 58.94%
rush_hour
34.92%
38.05%
39.13%
40.12%
40.93%
43.59%
47.79%
station2
39.86%
43.59%
44.79%
45.77%
46.62%
49.05%
53.15%
sunflower
41.58%
45.62%
46.77%
47.69%
48.47%
51.04%
54.85%
tractor
49.79%
54.44%
55.9%
57.03%
57.97%
60.93%
65.76%
Average
40.85%
44.61%
45.83%
46.82%
47.66%
50.30%
54.75%
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.
S.-H. Lee et al.: Lossless Frame Memory Recompression for VideoCodec Preserving Random Accessibility of Coding Unit
REFERENCES [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
R. Bruni, A. Chimienti, M. Lucenteforte, D. Pau and R. Sannino, "A novel adaptive vector quantization method for memory reduction in MPEG-2 HDTV decoders", IEEE Trans. Consum. Electron., vol.44, no.3, pp.537-544, August 1998. P.H.N. de With, P.H. Frencken and M.v.d. Scharr-Mitrea, "An MPEG decoder with embedded compression for memory reduction", IEEE Trans. Consum. Electron., vol.44, no.3, pp.545-555, August 1998. D. Pau and R. Sannino, "MPEG-2 decoding with a reduced RAM requisite by ADPCM recompression before storing MPEG-2 decompressed data", U.S. patent 5 838 597, November 1998. J. Tajime, T. Takizawa, S. Nogaki and H. Harasaki, "Memory compression method considering memory bandwidth for HDTV decoder LSIs", Proc. Intl. Conf. on Img. Process., pp.779-782, Oct. 1999. T. Y. Lee, "A new frame-recompression algorithm and its hardware design for MPEG-2 video decoders", IEEE Trans. Circuits Syst. Video Technol., vol.13, no.6, pp529-534, June 2003. Y. Lee, C. Rhee and H. Lee, "A new frame recompression algorithm integrated with H.264 video compression", Proc. Intl. Symp. on Circ.&Syst., pp.1621-1624, May 2007. Y. V. Ivanov, D. Moloney, "Reference frame compression using embedded reconstruction patterns for H.264/AVC decoder", Proc, Intl. Conf. on Digi. Telecomm., pp.168-173, July 2008. Y. Lee, T. Tsai, "An efficient embedded compression algorithm using adjusted binary code method", Proc. Intl. Symp. on Circ.&Syst., pp.2586-2589, May 2008. T. Song, T. Shimamoto, "Reference frame data compression method for H.264/AVC", IEICE Electron. Express, vol.4, no.3, pp.121-126, Feb. 2007. H. Gao, F. Qiao, H. Yang, "Lossless memory reduction and efficient frame storage architecture for HDTV video decoder", Proc, Intl. Conf. on Audio Lang. & Img. Process., pp.593-598, July 2008. T. Yng, B. Lee, H. Yoo, "A low complexity and lossless frame memory compression for display devices", IEEE Trans. Consum. Electron., Vol.54, no.3, pp.1453-1458, August 2008. S. Park, Y. Yi, I. Park, "High performance memory mode control for HDTV decoders", IEEE Trans. Consum. Electron., vol.49, no.4, pp.1348-1353, November 2003. P. Zhang, W. Gao, D. Wu, D. Xie, "An efficient reference frame storage scheme for H.264 HDTV decoder", Proc, Intl. Conf. on Multimedia and Expo, pp.361-364, July 2006. T. Song, T. Kishida, T. Shimamoto, "Fast frame memory access method for H.264/AVC", IEICE Electron. Express, vol.5, no.9, pp.344-348, May 2008. http://www.samsung.com/global/business/semiconductor/products/dram/ Products_DRAM.html Sang-Heon Lee received the BS and the MS degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Korea, in 2001 and 2003, respectively. He is currently pursuing the PhD degree in the Department of Electrical Engineering and Computer Science at KAIST. His research interests include VLSI design, hardware/software cosimulation, on-chip communication, and multimedia system architecture.
2113
Moo-Kyoung Chung received the BS degrees in Electrical Engineering from Korea University, Korea, in 1999, and the MS and PhD degrees in Electrical Engineering and Computer Science from the Korea Advanced Institute of Science and Technology (KAIST), Korea, in 2001 and 2006, respectively. After graduation from KAIST, he worked at Dynalith Systems, Korea, from January 2006 to June 2007. In July 2007, he joined the System Semi-Conductor Research Department at Electronics and Telecommunications Research Institute (ETRI), where he is now a senior researcher. His research interests include VLSI design, multiprocessor, multimedia, and design automation.
Sung-Mo Park received the BS, MS and PhD degrees in electronics engineering from Kyungpook National University, Taegu, Korea, in 1985, 1987, and 2006 respectively. From 1987 to 1992, he was with the LG semiconductor company, Gumi, Korea, where he worked on ASIC design and Mask ROM design. In 1992, he joined ETRI. He is now a team leader of multimedia processor design team and professor of University of Science and Technology. His main research interests are video coding, image compression, Multi Processor design and low power SoC architecture design..
Chong-Min Kyung received the B.S. degree in electronic engineering from Seoul National University, Korea, in 1975, and the M.S. and Ph.D. degrees in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1977 and 1981, respectively. After graduation from KAIST, he worked at AT&T Bell Laboratories, Murray Hill, NJ, from April 1981 to January 1983 in the area of semiconductor device and process simulation. In February 1983, he joined the Department of Electrical Engineering at KAIST, where he is now a Professor. His current research interests include microprocessor/DSP architecture, chip design, and verification methodology. He received the Most Excellent Design Award and the Special Feature Award in the University Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the Best Paper Award in the 36th Design Automation Conference (DAC) held in New Orleans, LA, in June 1999, the 10th International Conference on Signal Processing Application and Technology (ICSPAT), Orlando, FL, in November 1999, and the International Conference on Computer Design (ICCD), Austin, TX, in October 1999. He is a fellow of the IEEE
Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on March 15,2010 at 04:55:01 EDT from IEEE Xplore. Restrictions apply.