Efficient reconfigurable entropy coder for embedded multi-standards

Efficient reconfigurable entropy coder for embedded multi-standards video adaptation N. Marques, H. Rabah, E. Dabellani, S. Weber LIEN Laboratory of Instrumentation and Electronics of Nancy, University Henri Poincaré Nancy 1 BP 70239 54506 Vandoeuvre les Nancy Cedex, France {Nicolas.Marques, Hassan.Rabah,

Eric.Dabellani, Serge.Weber}@lien.uhp-nancy.fr

adequate entropy decoder. However, the decoders are instantiated simultaneously on chip as accelerators, which is inefficient in terms of area and flexibility. Peng et al. proposed in [2] VLSI architecture for a programmable parser. In this architecture, A RISC processor is used as an accelerator for shared functions between MPEG-2 and H.264 standards. This software solution is also used by Wu et al. in [3] to analyze the bitstream and dispatch data to the entropy decoders which are used as hardware accelerators. An interesting method to adapt the decoder is proposed by Bystrom et al. in [4]. This method consists of the replacement of decoding functions by transmitting the adequate configuration to the decoder that must be reconfigurable. Lo et al. in [5] have proposed a reconfigurable VLSI architecture for H.264 decoder in order to cover all possible profiles of this standard, mainly allowing the implementation of CAVLC or CABAC. This architecture is based on the use of coarse grained reconfigurable area in which similar and shared functions between CABAC and CAVLC are reconfigured, particularly the ExpGolomb. The optimization brought by reconfiguration is 25.4 % gain of CAVLC area, but a modest overall gain of 6% if CAVLC and CABAC are considered. The different existing solutions aim at bringing flexibility to the decoder by simultaneous implementation of entropy decoders, software programmability or by hardware reconfiguration. The obtained gain in terms of area and flexibility remains not significant due to the complexity and irregularity of these decoders. In this paper we will focus on the encoder part. Our contribution relies on the use of partial reconfiguration of Field Programmable Gate Array (FPGA). Partially reconfigurable systems are promising alternative to address this problem thank to their performances and flexibilities. However, due to the diversity of computation models, several issues such as dynamic switching or relocation of hardware tasks remain open research topics. We propose a solution allowing significant reduction in terms of area, reconfiguration time and data configuration size, with the improvement of flexibility and context switching. The rest of paper is structured as follows. The section 2 gives an overview of transcoding with a focus on encoding for Mpeg-2 and H264 standards. The proposed

Abstract This paper presents an efficient reconfigurable entropy encoder for real time adaptation of multi-standard compressed video stream. The proposed embedded architecture is based on partial and dynamic reconfiguration. Static and dynamic wrappers are defined to encapsulate different types of entropy coders with efficient swapping capabilities for context saving and restoring. The wrappers are designed to reduce the area overhead. A partitioning of reconfigurable area is also presented aiming at the optimization of reconfiguration overhead. The obtained results show a significant gain up to 40% in silicon area compared to a solution without reconfiguration and a significant gain of 23% in reconfiguration time. Thanks to reconfigurable entropy coder, it is possible to treat MPEG-2 and H.264 video streams meeting its real time constraints using dynamic reconfiguration.

1. Introduction The diversity of video compression standards and terminals requires the investigation of efficient and flexible architectural solutions for video adaptation. This adaptation can be performed in the terminal equipped with a universal decoder or in a home gateway by an adequate transcoding. The video bitstream decoding and reconstruction involve entropy decoding and encoding which are different from a standard to another. For example, in MPEG-2 a Variable Length Coding (VLC) is used, whereas in H.264 standards Context Adaptive Variable Length Coding (CABAC) or Context Adaptive Binary Arithmetic Coding (CAVLC) can be used. The simultaneous presence of all this coders and decoders in a chip will lead to significant costs in silicon area and power consumption. This issue is beginning to draw more and more researchers in academia and industry. The decoder is the aspect most dealt with, aiming particularly at a universal decoder. Chang et al. [1] have proposed a solution based on the analysis of video stream and its dispatching to the 144

architecture, detailed study and the results obtained from the different implementations are presented in section 3 and 4. Finally, section 5 gives the concluding remarks.

2.3. VLC Figure 2.b shows the data flow graph of a VLC encoder, which is used in MPEG-2 standards for encoding DCT transformed and quantized residual coefficient of 8x8 blocs scanned in zigzag order. Run-Level coding and Huffman coding are applied to quantized DCT coefficients. Run-Level refers to a run-length of zeros followed by a non-zero level. Huffman coder is an entropy coder that is optimum in the sense that it achieves the shortest average possible code word length for a source [9]. The entropy coder uses a conversion table of values, to generate data representing the code word of variable length and a length conversion table to generate data representing the length of code word. To generate the final bitstream, the code words are organized in byte format

2. Overview 2.1. Video transcoder Numerous studies have been published on video transcoding and adaptation. In [6] a classification in two categories has been proposed. The heterogeneous transcoding is related to the conversion between two different standards such as MPEG-2 to H.264. The homogeneous transcoding is used to make several adjustments to the same standard. In both cases, the conversion may involve bit rate, resolution or frame rate. The transcoding chain can be divided into three main parts: bitstream analyzer, pixel level adaptation and bitstream generation (figure 1). The transcoding procedure should be carried out in compressed bitstream and generate a new bitstream to be relevant in real time application. The generation of adapted bitstream requires information from the target decoder and a reconfigurable entropy encoder. The proposed bitstream generator is designed to support VLC for MPEG2 standard or CAVLC for H.264 standards and corresponds to the extension the work presented in [7]. Data flow graph for CAVLC and VLC IP cores are shown in figure 2.

Figure 1: Transcoder block diagram

2.2. CAVLC Figure 2.a shows the data flow graph of a CAVLC encoder, which is used in H.264 standard for encoding transformed and quantized residual coefficients of 4x4 blocks scanned in zigzag order. The CAVLC encoder can be partitioned into three phases: preprocessing, syntax elements encoding and bitstream formation. The preprocessing allows the extraction of information required for encoding syntax elements. Five different types of syntax elements are encoded using look-up tables. The encoded syntax elements are concatenated and transmitted in byte format. The main complexity of CAVLC comes from the context-adaptive encoding of coeff_token and levels. Five different VLC tables are available for coeff_token encoding and the choice of table depends on the number of nonzero coefficients in the neighboring left and top blocks. This data dependency requires a large memory to store the number of nonzero coefficients for high quality video encoding. The bitstream formation consists on assembling and concatenating different variable size code words before their organization in byte format.

Figure 2: Entropy coders: CAVLC (a), VLC (b); Influence of the swap on the outgoing bitstream (c).

2.4. Context data In either case of CAVLC and VLC, a variable length code word is generated for each coefficient bloc. During an interrupt, this part of the encoder can be managed with no particular complexity in context switching. However, the bitstream generation in a regular byte forma will require special circuitry to handle the rupture during the interruption. In fact, the variable length code words must be manipulated carefully to avoid any corruption in the generated bitstream. Figure 2.c shows the complexity of context saving and restoring when a switching between the two encoders is required.

145

reconfigurable area. The control module is also in charge of context saving and restoration. Before any allowable reconfiguration, the context corresponding to several registers is saved in local context memory and transmitted to the external system memory if it is necessary. The interruption signal starts the swapping procedure. In this procedure the interruption state of the IP is checked; if the IP is not interruptible, the controller waits the termination of the processing of the current block. The context saved will consist of the index of the last block processed and the necessary data to reconstruct the compressed bitstream. Once this is completed, the different areas are reconfigured. Then the controller achieves a context restoration for the loaded IP and resumes the execution. The static wrapper communicates with the IP cores through a dynamic wrapper specific to each IP core. The dynamic wrapper includes an interface whose complexity depends on the complexity of data transfer. The goal is to simplify and generalize the design of the static wrapper. Figure 3 show the dynamic wrapper for CAVLC and VLC depicting the specific signals and the common signals.

3. Proposed reconfigurable entropy coder In this section we will give the details of the proposed reconfigurable architecture for entropy coder. A dynamic wrapper is used to encapsulate the IP cores that will be placed in a reconfigurable area. A static wrapper is defined to allow the control of IP cores, the context management during swapping IP cores and the communication with the rest of system architecture.

3.1. Generic Wrapper The functional blocs composing the video transcoder process data in a streaming way. Therefore, the communication between IP cores can be performed through FIFOs in either case of MPEG-2 and H.264 codec. Thus, for the entropy coders, a generic and static wrapper can be used to encapsulate VLC and CAVLC IP cores. Any difference between the two IP cores will be included in a dynamic wrapper specific to the IP allowing the adaptation to the static wrapper. Codec Ip encoder Input Output

Main tables Fmax – F used Execution time/block Ip Generator Input Output

H264 Mpeg2 CAVLC VLC 4*4 8*8 Assembled Bitstream Coefftoken 261values DC 13 values Ct_length 261values AC13 values Totalzeros 144values DC_chro13 values Ct_Tz 144values AC_chro13 values Runbefore 27 values 160Mhz - 100Mhz 150Mhz - 100Mhz 190ns/4*4 block

1.17µs/8*8 block

Tobyte_cavlc Tobyte_vlc Assembled Bitstream H264 variable Mpeg2 variable bitstream bitstream

Table 1: CAVLC and VLC characteristics Figure 3: Simplified view of IP cores wrapper for partial reconfiguration showing static and dynamic wrapper.

An in depth analysis of CAVLC and VLC IP cores CAVLC show important differences in term of processing and some similarities in data manipulation and access (table 1). These IPs process data blocs (4x4 for CAVLC and 8x8 for VLC) and provides in their output compressed data in byte format. Control signals are also used particularly to manage the incoming data and the output bitstream. The similarities between the two IP cores are exploited and a generic static wrapper was designed to encapsulate the two IPs. The static wrapper is composed of an adaptable buffer capable of handling an 8x8 coefficients blocs or a set of 4x4 coefficients blocs. A controller manages the data transfer between the FIFO, IP core and the rest of the system. The controller generates and manages the handshake signals for data communication and configuration state of the

3.2. Reconfigurable area In this study we target the Xilinx FPGAs thanks to the capability of partial reconfiguration. To achieve the partial reconfiguration, a reconfigurable area must be defined in design phase. The shape and the location of reconfigurable area depend on the resources required by the IP cores to be placed in this area. The size of the reconfigurable area is chosen based on the largest IP core to accommodate with enough space/slices for IP cores. For this study, the partitioning of reconfigurable area is performed using Xilinx design flow for dynamic and partial reconfiguration based on PlanAhead. The partitioning problem is outside the scope of this paper. However, in order to find an

146

shape of reconfigurable area on efficiency of resources utilization. For this purpose, rectangular (figure 5.a) and polygonal (figure 5.b) partitions are investigated. The obtained results in terms of resource utilization (LUT, Slice L, Slice M and BRAM) are shown in table 2 and compared to a static implementation.

optimal partitioning and avoid area over-sizing, we explored a rectangular and polynomial shaped reconfigurable area. Detailed studies and generalized method under investigation will allow the extension of automatic research of optimal reconfigurable area presented in [8] to take into account different shapes.

3.3. System architecture The target system architecture for validation of reconfigurable entropy coder is shown in figure 4. This architecture is based on a PLB bus from Xilinx running at 100MHz. The components connected to this bus are mainly used to control and manage the reconfigurable area. Bitsream for CAVLC and VLC are generated in design phase and stored in flash memory managed by the system ACE. The Microblaze processor reads the adequate bitstream from compact flash and sends it to HW_ICAP (Hardware Internal Configuration Access Port), which configures the reconfigurable area.

Figure 5: Implementation IP cores with PlanAhead: rectangular partition (a), polygonal partition (b). (Virtex 5 – xc5vlx50t)

The partitioning was performed so that the reconfigurable area holds the necessary resources to implement VLC or CAVLC IP cores. Table 2 shows the obtained gain in resource utilization (25% to 40%) compared to a static implementation where the two IP cores are present in the circuit. Table 2 shows also that the rectangular shape is not optimal in terms of BRAM utilization (16 % of BRAM are not utilized).

Figure 4: System Architecture

4. Test, implementation and discussion In this section the results of implementation will be presented. The comparison between different implementation will be also presented. The implementation presented here uses only one reconfigurable area. This implementation is bound to evolve to obtain a system identical to the architecture in figure 4. The IPs CAVLC and tobyte form a single IP; it is the same for VLC and tobyte. Finally we present an video application multi stream.

Resources

Static

LUT Slice L Slice M Bram36

7398 1608 772 25

Reconfigurable

Optimization

Rectangle

Polygon

Rectangle

5598 973 502 29

5598 973 502 25

25% 40% 35% -16%

Polygon

25% 40% 35% 0%

Table 2: Resource utilization

The area efficiency is defined as the ratio between the resources used by an IP core and the total available resources of the reconfigurable area. Table 3 shows the area efficiency for CAVLC and VLC cores implemented in the same reconfigurable area. The maximum efficiency is obtained for polygon shaped area implementing VLC core (94%). However, as the partitioning was performed so that enough resource are available for the biggest IP core, a small IP core implemented in this area will imply a

4.1. Implementation results To implement the proposed architecture, we targeted the Virtex-5 FPGA using ISE 12.1 tool chain from Xilinx. The partitioning of reconfigurable area was performed with PlanAhead tool, which allows the automatic insertion of slice macros to interface static and dynamic regions. The first implementation aims at evaluating the impact of

147

40ms for SD. We must therefore determine the processing time for an image encoded with VLC and CAVLC (table 6). To determine the processing time, it is necessary to know the size of the image. Take the example of a stream HD 1440 * 1080 @ 30fps, each of these image consists of 97,200 blocks ((1440 * 1080 / (4 * 4)), the execution time for an image is 97200 * 190ns=18.47ms (190ns is the execution time / block of CAVLC (table 1)). Table 6 presents cases most used and gives process time.

lower area efficiency. This is the case when CAVLC is implemented either in rectangular or polygonal shape; the area efficiency is 32%. % use CAVLC

Resources

Rectangle

% use VLC

Polygon

Rectangle

Polygon

LUT 47% 47% 91% Slice L&M 49% 49% 91% BRAM 0% 0% 50% Total 32% 32% 78% Table 3: Area efficiency of reconfigurable region

91% 91% 100% 94%

HD_1 (max) 1920x1080@30fps

Codec Number of blocks H.264 - CAVLC 129600 blocs 4*4

HD_2 1440x1080@30fps

H.264 - CAVLC 97200 blocs 4*4

18.47ms

SD_1 (max) 720x576@25fps

H.264 - CAVLC 25920 blocs 4*4

4.93ms

SD_2 720x576@25fps

MPEG-2 - VLC 6480 blocs 8*8

7.6ms

Video format

Adding a static and dynamic wrapper which allows the reconfiguration may cause a significant overhead on the physical resources of FPGA. We have optimized this wrapper to reduce this overhead. Table 4 shows the average overhead by the wrapper compared to VLC and CAVLC IP. The average overhead of the wrapper on the two IPs is 7.5%. This overhead will be offset by the fact that only one IP is instantiated on the system. Resources

H264 Cavlc

Mpeg2

ToByte

Vlc

LUT 1043 147 2219 Slice 262 38 556 Bram36 0 0 4 Table 4: Wrapper overhead

ToByte

78 33 0

Wrapper

Overhead

Control/FIFO/Swap

Cavlc/ wrapper

Vlc/ wrapper

90 23 0

8.7% 8.5% 0%

4% 5.2% 0%

Reconfiguration time

For a multi-stream video, an encoder must be capable of processing an image of each stream while respecting the constraint of 40ms or 33.3ms to meet real time. Take the example of a flow composed of two videos. The first video is of type HD_2, it uses CAVLC and the second video is of type SD_2, it uses VLC. The most critical constraint is that of HD_2. The encoder must be capable of processing an image stream in less than 33.3ms. The constraint is given in (1), with respect to the reconfiguration time (tr) and execution time (te) of each IP. We can see that this constraint is met (2).

Rectangular partition Polygonal partition 20(18 CLB; 2 Bram) 19(18 CLB; 1 Bram) 3.85ms 18*118µ+2*863µ

24.63ms

Table 6: CAVLC and VLC treatment time

The reconfiguration time is also impacted by the size and the shape of the partition. The Virtex FPGAs from Xilinx reconfigured by Frames. The CLB (160 LUT and 80 Slices) is equivalent to a bitstream of 5904 bytes. Knowing that a Xilinx FPGA is reconfigured up to 50MB/s; it should take 118μs to reconfigure a CLB frame. For a BRAM frame the size is 24912 bytes, requiring 863μs to configure. Thus, for the rectangular partition, the reconfiguration time is 3.85ms. For the polygonal partition, a gain of 23% is obtained (table 5).

Frame

Processing time for one frame (ms)

(tr + te)V LC + (tr + te)CAV LC < constraint

(1)

2.98 + 18.47 + 2.98 + 7.6 = 32.03ms

(2)

Table 7 gives the execution times of different scenarios using the equation (1). The column Multi streams gives the different scenarios that refer to Table 6. Note that for the moment the system fails to process two streams of HD type. But, it is possible to treat a multi streams composed of two video (HD and SD with maximum resolution). We also note that it is possible to process eight identical video stream SD type H.264 or five streams MPEG-2. The system is also capable of processing four different streams SD. These results demonstrate the correctness of the functionalities of the CAVLC and VLC IP cores. The wrappers, the dynamic reconfiguration of the proposed reconfigurable entropy coder allow compliance the realtime constraints of a multi stream video transcoding. For the four scenarios (table 7) that do not meet the time constraint, there are several solutions. One can increase

2.98ms 18*118µ+1*863µ

Table 5: Reconfiguration time evaluation

4.2. Real Time Multi Stream Application A video stream can be composed of several videos that use different codecs. The entropy encoder must be able to handle multiple streams in real time. For this, the encoder must attain a throughput of 30 frames per second for highdefinition video (HD) and 25 frames per second for standard definition video (SD). Therefore an image must be processed in less than 33.3ms for HD and in less than 148

the frequency of VLC and CAVLC IP to reduce the execution time for an image. Another solution would be to use bitstream compression techniques to reduce the reconfiguration time. Multi streams HD_1 HD_1 HD_1 HD_2 HD_1 SD_1 HD_1 SD_2 HD_2 HD_2 HD_2 SD_1 HD_2 SD_2 SD_1 *8 streams SD_2 *5 streams SD_1 SD_2 SD_1 SD_2

Constraint 30 fps

Efficient reconfigurable entropy coder for embedded multi-standards

Efficient reconfigurable entropy coder for embedded multi-standards

Suggest Documents

Dynamically Reconfigurable Entropy Coder for Multi ...

successive partition zero coder for embedded ...

Memory Efficient Image Coding with Embedded Zero Block-Tree Coder

Microcoded Reconfigurable Embedded Processors

Design Tools for Reconfigurable Embedded Systems - CiteSeerX

Memory Security Management for Reconfigurable Embedded Systems

Reconfigurable Security Support for Embedded Systems

Embedded DRAM for a Reconfigurable Array - Berkeley ...

Middleware Based Executive for Embedded Reconfigurable Platforms

An Embedded Reconfigurable Architecture for ... - Semantic Scholar

Operating Systems for Reconfigurable Embedded Platforms: Online

Convergence in Reconfigurable Embedded Systems

Dynamically Reconfigurable Embedded Image ... - CiteSeerX

Dynamically Reconfigurable Embedded ... - Semantic Scholar

An Efficient Reconfigurable Architecture for Fingerprint Recognition

Efficient Task Scheduling for Runtime Reconfigurable Systems

Entropy Search for Information-Efficient Global Optimization

Efficient sub-band coder implementation for portable low ... - Onsemi

Design and Implementation of Reconfigurable Embedded Processor ...

Continuity Aspects of Embedded Reconfigurable Computing - CiteSeerX

Tuning the M-coder to improve Dirac's Entropy Coding - CiteSeerX

Efficient Methods for Calculating Sample Entropy ... - ScienceDirect.com

Autonomic Management of Reconfigurable Embedded Systems ...

Design and Implementation of Reconfigurable Embedded Processor ...