Dynamically Reconfigurable Entropy Coder for Multi-Standard Video Adaptation Using FaRMI F. Duhemb,∗, N. Marquesa,∗, F. Mullerb , H. Rabaha , S. Webera , P. Lorenzinib a
University Henri Poincar´e Nancy 1, LIEN, BP 239 - 54506 Vandoeuvre-l´es-Nancy Cedex, France b University of Nice-Sophia Antipolis - LEAT/CNRS, Bˆ at.4, 250 rue Albert Einstein, 06560 Valbonne, France
Abstract Dynamic and Partial Reconfiguration (DPR) is a feature present in modern Xilinx FPGAs, bringing flexibility to a whole new level. However, it is not yet wide spread in the industry because of poor performance and a lack of a cost model to estimate a solution early in the design process. In this paper, we present our methodology for developing systems capable of dynamic and partial reconfiguration with strict real-time constraints. Our approach is based on FaRM (Fast Reconfiguration Manager), a high-speed controller reaching the configuration port theoretical throughput in Xilinx FPGAs. FaRM performance is estimated using a cost model, which allows us to determine the optimum FIFO size to satisfy timing constraints with the best resources trade-off. We validate our approach with a video application that should be able to encode an H.264 (HD) and an MPEG-2 (SD) stream at the same time. For this we used two entropy encoders on the same reconfigurable zone, while satisfying constraints determined by the video streams. This is the first step of a fully reconfigurable video adaptation system. We also present our unified reconfigurable zone interfaces, specific to video adaptation. Keywords: Reconfiguration Manager, FPGA, Dynamic and Partial Reconfiguration, Reconfigurable entropy coder
I
This document is a collaborative effort. Corresponding author Email addresses:
[email protected] (F. Duhem),
[email protected] (N. Marques) Preprint submitted to Microprocessors and Microsystems ∗
July 13, 2012
1. Introduction Given the diversity of video compression standards and terminals, there is a need to look for efficient and flexible architectural solutions for video adaptation. This video adaptation generally entails performing decoding followed by encoding. Heterogeneous transcoding MPEG-2/H.264 involves several complex functions such as spatial transforms (DCT, IDCT, IT), motion estimation and compensation, and entropy coding and decoding (VLC-Variable Length Coding, CAVLC-Context Adaptive Variable Length Coding) [1]. The complexity of these functions and the real-time constraints require hardware implementations. Moreover, the diversity of transcoding scenarios necessitates implementing multiple data paths, each corresponding to a combination of the above functional units and leading to additional costs in terms of hardware. These costs can be reduced considerably by using the Field Programmable Gate Array (FPGA) feature called Dynamic and Partial Reconfiguration (DPR) [2]. Introduced in the late 90’s with the XC6200 series from Xilinx, DPR allows the behaviour of Reconfigurable Zones (RZ) in the FPGA to change while the remainder logic is still running. Therefore, DPR can be used to change the functionality of a system or to save FPGA resources and power [3, 4]. In a dynamically reconfigurable architecture, aspects such as hardware resource sharing, hardware task swapping, context saving and restoration must be taken into account to obtain a flexible system. However, this flexibility comes at the expense of performance degradation caused by reconfiguration overhead [5, 6]. Thus, it is clear that an efficient and flexible architecture handling multiple transcoding scenarios must involve minimal reconfiguration overhead. However, Xilinx’s controller, xps hwicap, comes with a performance level well below the configuration port theoretical throughput of 800MB/s and is not suitable for such applications. Therefore, reducing the reconfiguration overhead becomes one of the main matters in partially reconfigurable systems. FaRM (Fast Reconfiguration Manager) [7] overcomes this issue. Indeed, its architecture allows the theoretical throughput of 400 MB/s to be reached, making the configuration time almost invisible to the user. Moreover, FaRM comes with an accurate cost model, providing the designer with reconfiguration overhead estimations for the different operating modes. Therefore, it is possible for the designer to study the feasibility of a solution using DPR during the early stages of development. 2
In this paper, we present an adaptive entropy encoder for multi-standard video adaptation platform taking full advantages of the FaRM features and cost model. Starting from an HDL implementation of both entropy coders (VLC and CAVLC), we determine an architecture with the best trade-off between performance and resources overhead that meets high quality video streaming requirements. We also introduce a generic wrapper for these IPs, unifying interfaces that enable partial reconfigurability. The paper is structured as follows: in Section 2, we discuss works related to partial reconfiguration and video adaptation. Section 3 presents some basic reconfiguration concepts and introduces FaRM. Section 4 presents the application, while Section 5 explains the system architecture and its design. Our results are presented in Section 6 and Section 7 puts some later improvements on display. Finally, we conclude in Section 8. 2. Related works 2.1. Partial reconfiguration There have been a lot of works led to improve DPR performance, either by using bitstream compression or by optimizing the architecture. In [8], authors discuss different compression techniques, from their own algorithm to state-of-the-art algorithms such as ZIP. Work presented in [9] tends to be more exhaustive, also taking resources overhead and throughput into account in order to define compression algorithms. Indeed, throughput is not the only important metric, as we consider that area overhead has to be minimal for the reconfiguration controller not to be invasive. Another approach to compress bitstreams is introduced in [10], where the authors rely on their own placement method and on a careful study of the physical configuration level, in order to reduce their bitstream size up to 30% on a Virtex-II Pro. A similar approach is introduced in [11], composed of two steps. The first one removes data redundancy in the bitstream, which is being compressed using an arithmetic coder during the second step. With this algorithm, bitstream size might be reduced by a factor of 4.26, while a ZIP algorithm only reaches a factor of 3.3. However, this approach in not suitable for high-performance systems, as the decompression algorithm takes about 1.9 seconds on an ARM7 processor running at 206 MHz. An interesting approach is described in [12], where the authors not only compress the bitstreams but also use similarities between the current configuration and the new one to reduce reconfiguration overhead. This technique 3
is known as inter-configuration compression. When maximizing reuse between IPs, authors improved compression ratios up to 75%. In [13], the authors present a bitstream repository hierarchy for partially reconfigurable systems reducing the need for local memory. Considering only the first level of the hierarchy, i.e. when the bitstreams are stored locally, they introduce an architecture based on a DMA writing bitstreams to the configuration port through an OPB bus. A similar approach is presented in [14]. In this work, the authors compare several architectures, from Xilinx controllers to their high-speed controller, using BRAM as a local cache memory. Even if they are close to maximum theoretical throughput, the IP requires a large amount of BRAM (nearly half the BRAM resource of a Virtex-4 FX20 FPGA). Another complete dynamically reconfigurable platform is described in [15], targeting streaming applications. This work makes use of a Predictive Configuration Manager (PCM) [16] which consists in predicting following reconfigurations in order to prefetch configuration data. Part of the reconfiguration process is done in background while the current task is still running, thus reducing the overall reconfiguration time overhead. The authors of [17] use virtual configurations to decrease reconfiguration time. This method consists in implementing a background context in addition to the foreground running context. DPR may influence the background context while the foreground context is still running, thus reducing reconfiguration overhead. The major drawback of this method appears when working with a mono-context FPGA: there have to be as many RZs as configuration contexts, resulting in significant size overhead. Some of the methods described here have been used inside FaRM to reduce reconfiguration time overhead [7], reaching a throughput of 800 MB/s. However, FaRM is not the fastest reconfiguration controller, even though it overtakes existing controllers in terms of functionalities (e.g. an easy-to-use and efficient readback capability, standalone API and Linux driver [18]). For instance, authors in [19] reach a throughput of 1200 MB/s by overclocking the ICAP to a frequency of 300 MHz. UPaRC [20], with an approach very similar to FaRM, reaches a throughput of 1433 MB/s and is, as far as we know, the fastest controller to date. 2.2. Video adaptation Video adaptation can be performed by transcoding a stream. Transcoding is the data conversion of one encoding to another [21]. This is usually 4
done in cases where a target device does not support the format, or to convert incompatible or obsolete data to a better-supported or modern format. Transcoding refers to a two-step process in which the original stream is decoded to an uncompressed format which is then encoded into the target format. It is generally classified into two categories. The first one is heterogeneous transcoding, which deals with two different standards, for example transcoding between H.264 and MPEG-2. The second one is homogeneous transcoding, which is used to perform several adaptations in the same standard, like resolution reduction or bit rate and frame rate modification [22]. The decoder and the encoder are the aspects most dealt with, aiming particularly at a universal transcoder. A video encoder is composed of several functions such as quantization, transform, their inverses, intra and inter prediction and entropy coder. Among all these functions, the entropy encoding is one of the largest in size and complexity. Many research works aim at improving the functioning of entropy coder (VLC and CAVLC). Chang et al. [16] have proposed a solution based on the analysis of video stream and its dispatching to an adequate entropy decoder. Chang et al. [23] have proposed a solution based on the analysis of video stream and its dispatching to an adequate entropy decoder. However, this solution is parallel, i.e. the entropy decoders are instantiated simultaneously in chips as accelerators, which is inefficient in terms of area and flexibility. An interesting method to adapt the decoder is proposed by Bystrom et al. in [24]. This method consists in the replacement of decoding functions by transmitting the adequate configuration to the decoder, which must be reconfigurable. In [25], Lo et al. have proposed a reconfigurable VLSI (Very Large Scale Integration) architecture for an H.264 decoder in order to cover all the possible profiles of this standard, mainly allowing the implementation of CAVLC or CABAC (Context-Adaptive Binary Arithmetic Coding). This architecture is based on the use of coarse grained reconfigurable area, in which similar and shared functions between CABAC and CAVLC are reconfigured, particularly the ExpGolomb. The optimization brought by reconfiguration is 25.4% gain in CAVLC area, but a modest overall gain of 6% if CAVLC and CABAC are considered. In [26], the authors present an architecture for a programmable encoder. In this architecture, a Reduced Instruction Set Computer (RISC) processor is used as an accelerator for shared functions between MPEG-2 and H.264 standards.
5
3. FaRM overview In Xilinx FPGAs, DPR is performed by writing configuration files, called bitstreams, into the Internal Configuration Access Port (ICAP), the theoretical upper bound of which is 400 MB/s (32 bits @ 100 MHz). Therefore, reconfiguration time overhead may be improved by either reducing bitstream size or optimizing the ICAP controller architecture. FaRM combines these methods in order to provide an efficient DPR feature. FaRM is built upon a smart master/slave interface that can be plugged to a wrapper (either a PLB or AXI wrapper), separating control accesses from the scheduler on the slave side from data accesses on the master side. Master accesses provide FaRM with a direct access to the bitstream repository: no external action is required to perform reconfiguration aside from control accesses, performed either by a microprocessor (e.g. MicroBlaze) or a dedicated hardware IP. FaRM also integrates FIFOs, used to prefetch data when the next configuration is known in advance. This fast memory placed near the ICAP hides most of the reconfiguration process while still executing the previous task, highly reducing configuration overhead. This operating mode is referred to as preload mode. We were able to reach the ICAP theoretical throughput of 400 MB/s when working at a frequency of 100 MHz and even reach a throughput of 800 MB/s with ICAP frequency set to 200 MHz. Works related in [20] proved that it is possible to go further in this direction by reaching a throughput of 1433 MB/s with ICAP frequency set to 362.5 MHz. WRITE_Clk
Wrapper
Scheduling Algorithm
PLB CTRL bus
Register bank 16-400k x 32 bits
FIFO write
PLB Slave
Smart Interface
DATA bus
PLB
Write FSM + RLE decoding
512 x 32bits
Control FSM
ICAP
FIFO read
Bitstream storage
PLB Master
irq
CTRL_Clk
Read FSM READBACK_Clk
Figure 1: FaRM architecture
FaRM also handles compressed bitstreams, reducing the time spent 6
on data transfers on the bus. The compression algorithm used is an improvement of the well-known Run-Length Encoding standard (RLE), called Offset-RLE. This technique reduces bitstream size by 26.7% on average for representative partial bitstreams using different kinds of resources available on the FPGA. Bitstream compression is done off-line during the design phase, while decompression is done on-line and on the fly, with a one word per cycle throughput. We chose to implement O-RLE rather than state-of-the-art and more efficient algorithms like ZIP because we wanted a lightweight solution that would not require too much resource from the FPGA. The decompression algorithm also has to be simple enough to allow for maximum throughput ICAP writing. FaRM solution also provides an accurate cost model for estimating reconfiguration time overhead. It is configurable to accommodate the configuration subsystem architecture and FaRM operating modes (canonical write operation, preload mode, with or without compression). The cost model provides the designer with an estimate of the reconfiguration time, useful to determine key design parameters as we will prove in this paper. Finally, we mentioned in the related works section that some controllers like UPaRC reach higher throughputs than FaRM. However, the works achieved within the framework of our project target an industrial use of our solution. From the industrial point of view, it is preferable to respect specifications from the datasheet. Therefore, we will not use ICAP overclocking to enhance performance and we will set ICAP frequency to 100 MHz, even though our controller is able to run at 200 MHz. For any further information on FaRM, please refer to [7]. 4. Application To validate our approach, we chose to use a video transcoding chain application, targeting a Xilinx xc5vlx50t FPGA. Within the framework of our project, we try to develop an adaptive system that may handle several standards using partial reconfiguration. For instance, it should be possible to handle a MPEG-2 and a H.264 stream simultaneously. Let us consider the entropy coder of both standards. The different existing solutions in the related works aim at bringing flexibility to the decoder by simultaneous implementation of entropy decoders with a static implementation and software programmability or by hardware reconfiguration. The obtained gain in terms of area and flexibility remains insignificant, due to the complexity and irregularity of the coders. Our contribution relies on the use of 7
the partial reconfiguration of FPGA. Partially reconfigurable systems are promising alternatives to address this problem, thanks to their performance and flexibility. Using the reconfiguration between VLC and CAVLC allows significant reduction in terms of area with the improvement of transcoding flexibility. CAVLC (a)
VLC (b) block 8*8
block 4*4
Pre Processing
Pre Processing
Level
Trailling Ones
Total Zero
(0=- / 1=+)
(table)
Coeff Token (table[nC])
RLE
Run Before (table)
nC Ncoeff< maxNcoeff
Ncoeff >1
more
toBytes
H.264 bitstream
TABLE
Huffman
toBytes
MPEG-2 bitstream
Figure 2: Entropy coders: CAVLC (a), VLC (b)
The transcoding procedure should be carried out with a compressed bitstream and generate a new bitstream to be relevant in a real time application. The proposed bitstream generator is designed to support VLC for MPEG-2 standard or CAVLC for H.264 standards. Fig. 2 shows the data flow graph of a CAVLC encoder, used in the H.264 standard for the encoding of 4x4 blocks scanned in zigzag order. The CAVLC encoder can be partitioned into three phases: preprocessing, syntax element encoding and bitstream formation. Fig. 2.b shows the data flow graph of a VLC encoder, which is used in MPEG-2 standards for encoding the DCT transformed and quantized residual coefficient of 8x8 blocks scanned in zigzag order. In order to save area resources on the FPGA, we want to handle both standards alternatively using only one reconfigurable zone. The system should be able to meet real-time constraints even when treating two streams 8
Table 1: CAVLC and VLC information
Codec - Entropy encoder H.264 - CAVLC MPEG-2 - VLC Frequency (MHz) 165 150 Frame max resolution 1920x1080 720x576 Number of blocks 129600 (4x4) 6480 (8x8) Execution time/block - /frame 193ns - 24.7ms 1.17ns - 7.6ms
at the highest resolution, e.g. a H.264 video with a resolution of 1920x1080 @ 30 fps and an MPEG-2 video with a resolution of 720x576 @ 25 fps (see Table 1). For a multi-stream video, an encoder must be capable of processing an image from each stream while respecting the constraint of 40ms or 33.3ms to meet real time. Let us take the example of a flow composed of two videos. The first video uses CAVLC and the second video uses VLC. The most critical constraint is for CAVLC. Therefore, the encoder must be capable of processing an image stream in less than 33.3ms. The constraint is given in (1), with respect to the reconfiguration time (tr ) and execution time (te ) of each IP, leading to (2): the sum of both reconfiguration times should not exceed one millisecond.
(tr + te )V LC + (tr + te )CAV LC < 33.3ms (tr )V LC + (tr )CAV LC < 1ms
(1) (2)
5. System architecture In this section, we give the details of the proposed architecture for a reconfigurable entropy coder (CAVLC for H.264 or VLC for MPEG-2) on a single reconfigurable zone using FaRM. In order to unify the RZ interface, a necessary step for achieving partial reconfiguration, we also present our IP wrappers. 5.1. Global architecture The target system architecture is shown in Fig. 3. This architecture is based on a PLB bus from Xilinx running at 100 MHz. The components connected to this bus are mainly used to control and manage the reconfigurable area. Bitstreams for CAVLC and VLC, compressed or not, are 9
generated during the design phase and stored in a flash memory managed by the SystemACE. They are copied into the DDR memory at the beginning of the execution. The processor (Microblaze) configures FaRM, which directly retrieves the bitstream from the DDR and transfers it into the ICAP macro. When using pre-load mode, the bitstream, or at least the start of the stream, is transfered into the local FIFO. Once the FIFO is full, the transfer is ready to be started. Microblaze
FaRM /
UART
Ficap = 100MHz
Terminal
PLB FPLB = 100MHz
Static Wrapper
MPMC
System ACE
Control
ZigZag scan
cavlc.z32 vlc.z32 blank.z32
Compact Flash
F I F O
Chipscope
DDR
Test Coefficients
Reconfigurable area
IP core
Interface to wrapper
Interface to wrapper
VLC IP core
CAVLC IP core
Dynamic Wrapper
Dynamic Wrapper
Figure 3: System architecture and wrapping IP cores for partial reconfiguration
5.2. Generic wrapper The communication between IP cores can be performed through FIFOs in either the case of an MPEG-2 or an H.264 codec. Thus, for the entropy coders, a generic and static wrapper can be used to encapsulate VLC and CAVLC IP cores (Fig. 3). An in-depth analysis of CAVLC and VLC IP cores shows important differences in terms of processing, and some similarities in data manipulation and access. These IPs process data blocks (4x4 for CAVLC and 8x8 for VLC) and provide, at their output, compressed data in byte format. Any difference between the two IP cores will be included in a generic static wrapper designed to encapsulate the two IPs. The static wrapper is composed of an adaptable buffer capable of handling an 8x8 10
coefficient block (for VLC) or a set of 4x4 coefficient blocks (for CAVLC). A controller manages the data transfer between the FIFO, IP core and the rest of the system. The controller generates and manages the handshake signals for data communication and the configuration state of the reconfigurable area. The control module is also in charge of context saving and restoration. The saved context will comprise the index of the last processed block and the necessary data to reconstruct the compressed bitstream. Once this is completed, the different areas are reconfigured. Then the controller achieves a context restoration for the loaded IP and resumes the execution. The static wrapper communicates with the IP cores through a dynamic wrapper specific to each IP core. The dynamic wrapper includes an interface, the complexity of which depends on the complexity of the data transfer. The goal is to simplify and generalize the design of common signals. The output of both IP cores is displayed using Chipscope, a logic analyzer instantiated in the design, and is compared to ModelSim simulation results. These results demonstrate the correctness of the functionalities of the CAVLC and VLC IP cores, their wrappers and the dynamic reconfiguration of the proposed reconfigurable entropy coder. 5.3. Methodology for determining optimum FIFO size Figure 4 presents our methodology for determining the optimum FIFO size for FaRM with respect to the application constraints. 1
Synthesis tool (e.g. Xilinx ISE) HDL sources
2
Parser Constraint file (.ucf)
Synthesis reports
3 RZ Generator
PlanAhead PR project 4
Bistream size
Compression ratio 5 Cost model Reconfiguration overhead
Figure 4: Methodology for determining reconfiguration overhead
The cost model should be fed with information about the partial bitstream, which are bitstream size and compression ratio. Bitstream size is 11
directly related to the reconfigurable zone hosting the IPs. The size of the reconfigurable area is chosen on the basis of the largest IP core to be accommodated with enough space/slices for IP cores. This piece of information is given by the synthesis of each reconfigurable task (this is step 1 in Fig. 4). Indeed, synthesis is necessary to generate the netlists used during the implementation phase and also provides an estimation of the resources needed by each task. In particular, it generates synthesis reports gathering every important piece of information. These reports are parsed during step 2 to feed the RZ generator (step 3), a tool using task information to determine the optimum RZ size and shape. Using [27, chap. 6], we know that each CLB contains four LUTs and four flip-flops, with 40 CLBs per column per clock domain (which is the smallest reconfigurable partition containing CLBs) for Virtex-5 technology. However, there are some issues when using the exact optimum area for the reconfigurable zone : the router cannot succeed, leading to an implementation failure because there may be some extra routing needed. To avoid this issue, we have chosen to add an extra 5% resources for the reconfigurable zone, which, in our experience, always solves the problem. Once the optimum partition size is determined, the partition should be placed on the FPGA. Placement can be very tricky, particularly when handling several partitions. In such cases, manual placement takes a lot of time and should be substituted by a placement algorithm for minimizing development effort [28]. In our case, there is only one partition to place, so this can be done manually in order to find a good placement for a reconfigurable zone with the optimum area. Nevertheless, we plan to develop a tool able to perform placement for several partitions. Now that the partition is defined, we can calculate the length of the resulting bitstream. For each resource, [27, chap. 6] gives us the number of frames actually present in the column per clock domain, a frame containing 41 x 32-bit words in a Virtex-5. For instance, a CLB column is composed of 36 frames, a DSP column of 30 frames and a BRAM column of 30 interconnect frames and 128 content frames. The bitstream is also composed of an approximately 50 word-long command section that may be neglected most of the time (for reconfigurable partitions more than one or two columns). One last piece of information to retrieve before using our cost model is compression. Unfortunately, there is no way to estimate compression with enough accuracy without performing place & route. Indeed, we tried to correlate compression with internal fragmentation (i.e. how the partition 12
is used, given the task estimated resources and the partition resources). The results were not convincing enough since we cannot take routing into account. Therefore, the only way to have a proper estimation is to generate the partial bitstream from a dummy project composed of the reconfigurable IP only (this is step 4). Since the routing effort will be concentrated into the reconfigurable partition, this partial bitstream will be pretty close to the actual one, generated from the entire project, and will provide a nice approximation for compression. Finally, the cost model can use the bitstream size and its compression ratio to estimate the reconfiguration overhead in our architecture during step 5. 6. Results and discussions 6.1. Reconfigurable zone definition We use the methodology defined in Sect. 5.3 in order to estimate the partition size and the reconfiguration overhead. Synthesis of both tasks leads to the resources estimation detailed in Table 2. In order to host both tasks on the same RZ, we will tailor the partition for VLC. Therefore, considering an extra 5% resources for the routing process, we should use a partition containing at least 16 CLB columns and one BRAM column. As previously mentioned, placement of one task can be done manually without too much effort. Therefore, we defined a zone perfectly fitting the tasks, resulting in a bitstream approximatively 132kB long.
Table 2: Resources estimation for CAVLC and VLC IPs
IP VLC CAVLC Number of slice registers 548 277 Number of slice LUTs 2323 1203 Number of Block RAM/FIFO 2 0
After creating a PlanAhead project composed only of each reconfigurable task, our tool runs our compression algorithm on the generated partial bitstream. Table 3 presents the compression ratios obtained in our case. We achieve an average compression ratio of 31% for actual configuration bitstreams (CAVLC and VLC), with a peak value of 97% for the blank 13
bitstream. In addition to reducing overhead, bitstream compression also lowers the memory requirements. Without compression, the memory needs 390kb in order to store the three bitstreams, whereas with compression the memory requirements drop to 184kb, representing a 47% improvement. Table 3: Bitstream compression ratios
Bitstream Original length (bytes) Compressed length (bytes) Compression rate (%)
CAVLC 130k 62k 52.1
VLC Blank 130k 130k 95k 8.6k 26.9 93.4
6.2. Cost model utilization Table 4 summarizes the estimations made with the FaRM cost model in different operating modes (basic write operation, compressed bitstream, preload only and preload with a compressed bitstream). For each estimation involving compression, the cost model gives a lower and an upper bound, due to compression being heterogeneous inside a bitstream. We took the upper bound to ensure a successful implementation.
Table 4: Reconfiguration overhead estimation in ms
Operating mode CAVLC VLC Total overhead Basic 0.99 0.99 1.98 Basic with compression 0.64 0.81 1.45 2k 0.92 1.84 Preload 4k 0.86 1.72 only 8k 0.74 1.48 16k 0.49 0.98 32k+ 0.33 0.66 2k 0.57 0.74 1.31 Preload & 4k 0.44 0.68 1.12 compression 8k 0.33 0.54 0.87 16k+ 0.33 0.33 0.66 14
We can see that there are two cases of use where the total reconfiguration overhead is lower than one millisecond: when using preload mode only with a FIFO more than 32k words deep and preload mode with a FIFO more than 8k words deep coupled with bitstream compression. Therefore, we would rather use a 8k FIFO together with compression since BRAM is a scarce resource in the targeted FPGA. 6.3. Implementation results Table 5 compares FaRM to Xilinx’s solution in terms of reconfiguration time and throughput. We can see that these results validate our architecture: indeed, the total time spent for reconfiguration of CAVLC or VLC does not exceed one millisecond (Sect. 4) when combining pre-load mode with a 8k words deep FIFO and bitstream compression.
Table 5: FaRM vs. xps hwicap reconfiguration times in ms for a 8k deep FIFO
Xilinx’s xps hwicap F Basic a With compression R With pre-load M pre-load & compression
CAVLV 2.75 1 0.66 0.738 0.32
VLC 2.75 1 0.74 0.738 0.32
Reconfiguration allows for optimizing the silicon area of the FPGA. Table 6 compares our reconfigurable implementation with a static implementation where both IP cores (CAVLC and VLC) are present in the circuit. The average gain obtained with a reconfigurable implementation is 10.5%.
Table 6: Resource utilization for static implementation or reconfigurable implementation
Implementation Static Reconfigurable Number of slice registers 12000 9450 Number of slice LUTs 3900 3220 Number of Block RAM/FIFO 28 30 15
Optimization 21.3% 17.5% -7.1%
The area efficiency is defined by the ratio between the resources used by an IP core and the total available resources of the reconfigurable area. Table 7 shows the area efficiency for CAVLC and VLC cores implemented in the same reconfigurable area. The maximum efficiency is obtained for the VLC core implementation (80%). However, as the partitioning was performed so that enough resource is available to the biggest IP core, a small IP core implemented in this area will imply a lower area efficiency. This is the case when CAVLC is implemented, the area efficiency is 32%. Also, we only obtain a 50% efficiency for Block RAM with VLC, despite the zone being defined to fit VLC. This is due to partial reconfiguration granularity in Virtex-5 FPGAs, which is the height of a column. However, a column contains four BRAM so that even if we only need two BRAM, the zone has to include four BRAM. Table 7: Reconfigurable area efficiency
IP implemented on reconfigurable area CAVLC Number of slice registers 50.1% Number of slice LUTs 46.2% Number of Block RAM/FIFO 0%
VLC 96.8% 91.3% 50%
Adding a static and dynamic wrapper may cause a significant overhead on the physical resources of FPGA. We have optimized this wrapper to reduce this overhead. Table 8 shows the overhead added by the wrapper, compared to VLC and CAVLC IP cores. The average overhead of the wrapper on both IPs is less than 5%. This overhead will be offset by the fact that only one IP is instantiated on the system. Table 8: Wrapper overhead
IP
CAVLC
VLC
Wrapper
Number of slice registers Number of slice LUTs Number of BRAM/FIFO
1203 277 0
2323 548 2
90 34 0
16
Overhead CAVLC VLC 7.5% 3.9% 12.3% 6.2% 0% 0%
7. Future works There are quite some limitations on our reconfigurable partition selector. Indeed, it only handles rectangular zones that may be wasting some resources, which can be particularly serious with limited resources such as BRAM. One solution is to handle L-shaped partitions, separating the CLB needs from the BRAM needs. If it is not possible or if it implies an important wire length overhead, another solution would be to synthesize the IP again, but forcing the tool to use only CLB resources instead of Block-RAM and DSP48, e.g. using LUT-RAM instead of Block-RAM. Therefore, an IP like VLC or CAVLC would no longer need BRAM to be included in the partition. However, this solution comes at the expense of CLB resources and should be studied in order to obtain a good trade-off. Another issue that should be investigated in the future concerns the development and the evaluation of a complete transcoding chain based on our reconfigurable architecture. This will enable us to realize an adaptive system for multiple video standards. The proposed wrapper for our reconfigurable architecture is already designed to be used with different IP cores from the transcoder such as quantization, transforms and prediction. 8. Conclusion In this paper, we presented a methodology to use dynamic and partial reconfiguration with applications that have severe real-time constraints. This methodology uses FaRM, a high-speed reconfiguration controller coupled with its cost model. Our approach was validated using a reconfigurable entropy coder, where DPR is used to switch from one standard (MPEG-2 VLC) to another (H.264 - CAVLC). We presented a generic wrapper unifying interfaces for both adaptation standards. The obtained results are very promising for the implementation of the full reconfigurable transcoder. In our ongoing work, we are applying the same methodology to all IP cores composing the video encoder and the video decoder. FaRM was parameterized to satisfy the overhead constraint (reconfiguration time under one millisecond) and to give the best area vs performance trade-off. We verified our assumption with a Virtex-5 device, reaching the configuration port theoretical throughput under some conditions.
17
Acknowledgements This work was carried out within the framework of project ARDMAHN [29] sponsored by the ANR, which aims at developing methodologies for home gateways that integrate partial and dynamic reconfiguration. References [1] M. Guarisco, H. Rabah, A. Amira, Dynamically reconfigurable architecture for real time adaptation of H264/AVC-SVC video streams, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, 2010, pp. 39 –44. [2] N. Marques, H. Rabah, E. Dabellani, S. Weber, Partially reconfigurable entropy encoder for multi standards video adaptation, Consumer Electronics (ISCE), 2011 IEEE 15th International Symposium on (2011) 492 –496. [3] C. Kao, Benefits of Partial Reconfiguration, Xcell Journal 55 (2005) 65–67. [4] K. Paulsson, M. H¨ ubner, S. Bayar, J. Becker, Exploitation of Run-Time Partial Reconfiguration for Dynamic Power Management in Xilinx Spartan III-based Systems, in: Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, 2008, pp. 699–700. [5] K. Compton, S. Hauck, Reconfigurable computing: a survey of systems and software, ACM Comput. Surv. 34 (2002) 171–210. [6] R. Tessier, W. Burleson, Reconfigurable Computing for Digital Signal Processing: A Survey, J. VLSI Signal Process. Syst. 28 (2001) 7–27. URL http://dl.acm.org/citation.cfm?id=598544.598597 [7] F. Duhem, F. Muller, P. Lorenzini, Reconfiguration time overhead on field programmable gate arrays: reduction and cost model, IET Computers & Digital Techniques 6 (2) (2012) 105–113. doi:10.1049/iet-cdt.2011.0033. URL http://link.aip.org/link/?CDT/6/105/1 [8] K. Siozios, G. Koutroumpezis, K. Tatas, D. Soudris, A. Thanailakis, DAGGER: A Novel Generic Methodology for FPGA Bitstream Generation and Its Software Tool Implementation, in: Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, 2005, pp. 165b – 165b. [9] D. Koch, C. Beckhoff, J. Teich, Bitstream Decompression for High Speed FPGA Configuration from Slow Memories, in: Field-Programmable Technology, 2007. ICFPT 2007. International Conference on, 2007, pp. 161 –168. [10] H. Tan, R. DeMara, A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead, Reconfigurable Computing and FPGAs, International Conference on 0 (2006) 1–5. [11] M. Martina, G. Masera, A. Molino, F. Vacca, L. Sterpone, M. Violante, A new approach to compress the configuration information of programmable devices, in: Proceedings of the conference on Design, automation and test in Europe: Designers’ forum, DATE ’06, European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 2006, pp. 48–51. URL http://dl.acm.org/citation.cfm?id=1131355.1131366
18
[12] L. He, T. Mitra, W.-F. Wong, Configuration bitstream compression for dynamically reconfigurable fpgas, in: Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design, ICCAD ’04, IEEE Computer Society, Washington, DC, USA, 2004, pp. 766–773. doi:10.1109/ICCAD.2004.1382679. URL http://dx.doi.org/10.1109/ICCAD.2004.1382679 [13] P. Bomel, J. Crenne, L. Ye, J.-P. Diguet, G. Gogniat, Ultra-Fast Downloading of Partial Bitstreams through Ethernet, in: Proceedings of the 22nd International Conference on Architecture of Computing Systems, ARCS ’09, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 72–83. [14] M. Liu, W. Kuehn, Z. Lu, A. Jantsch, Run-time Partial Reconfiguration Speed Investigation and Architectural Design Space Exploration, in: Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, 2009. [15] A. Grasset, P. Millet, P. Bonnot, S. Yehia, W. Putzke-Roeming, F. Campi, A. Rosti, M. Huebner, N. Voros, D. Rossi, H. Sahlbach, R. Ernst, The morpheus heterogeneous dynamically reconfigurable platform, International Journal of Parallel Programming 39 (2011) 328–356, 10.1007/s10766-010-0160-3. URL http://dx.doi.org/10.1007/s10766-010-0160-3 [16] S. Chevobbe, S. Guyetant, Reducing reconfiguration overheads in heterogeneous multicore rsocs with predictive configuration management, Int. J. Reconfig. Comput. 2009 (2009) 8:4–8:4. doi:10.1155/2009/390167. URL http://dx.doi.org/10.1155/2009/390167 [17] M. Liu, Z. Lu, W. Kuehn, A. Jantsch, Reducing FPGA Reconfiguration Time Overhead using Virtual Configurations, ReCoSoC. [18] C. Foucher, F. Muller, A. Giulieri, Fast Integration of Hardware Accelerators for Dynamically Reconfigurable Architecture. [19] C. Claus, R. Ahmed, F. Altenried, W. Stechele, Towards Rapid Dynamic Partial Reconfiguration in Video-Based Driver Assistance Systems, in: P. Sirisuk, F. Morgan, T. El-Ghazawi, H. Amano (Eds.), Reconfigurable Computing: Architectures, Tools and Applications, Vol. 5992 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2010, pp. 55–67, 10.1007/978-3-642-12133-3 8. URL http://dx.doi.org/10.1007/978-3-642-12133-3 8 [20] R. Bonamy, H.-M. Pham, S. Pillement, D. Chillet, Uparc - ultra-fast power-aware reconfiguration controller, in: W. Rosenstiel, L. Thiele (Eds.), DATE, IEEE, 2012, pp. 1373–1378. [21] A. Vetro, C. Christopoulos, H. Sun, Video transcoding architectures and techniques: an overview, Signal Processing Magazine, IEEE 20 (2) (2003) 18 – 29. [22] J. Zhang, A. Perkis, N. D. Georganas, H. 264/avc and transcoding for multimedia adaptation, Proceedings of the 6th COST 276 Workshop (2004) 1–6. [23] Y.-C. Chang, R.-C. Chang, L.-G. Chen, Design and implementation of a bitstream parsing coprocessor for mpeg-4 video system-on-chip solution, VLSI Technology, Systems, and Applications, 2001. Proceedings of Technical Papers. 2001 International Symposium on (2001) 188 –191. [24] M. Bystrom, I. Richardson, S. Kannangara, M. de Frutos-Lopez, Dynamic replacement of video coding elements, Image Commun. 25 (2010) 303–313. [25] C.-C. Lo, S.-T. Tsai, M.-D. Shieh, A reconfigurable architecture for entropy decoding and idct in h.264, VLSI Design, Automation and Test, 2009. VLSI-DAT ’09.
19
International Symposium on (2009) 279 –282. [26] J. Peng, X. Qin, J. Yang, X. Yan, X. Chen, A programmable bitstream parser for multiple video coding standards, Innovative Computing, Information and Control, 2006. ICICIC ’06. First International Conference on 3 (2006) 609 –612. [27] Xilinx Inc., Virtex-5 Configuration User Guide (2010). [28] B. Ouni, I. Belaid, F. Muller, M. Benjemaa, Placement of Hardware Tasks on FPGA using the BEE Algorithm, in: International Conference on Pervasive and Embedded Computing and Communication Systems (PECCS11), 2011. [29] ARDMAHN consortium, ARDMAHN project, http://ARDMAHN.org/.
20